This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.
I returned to work on Monday after being off last week. As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things. This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.
With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus. In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next. Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches. It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash. So all matches failed. I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text). I also added in a new column that counts the number of matches not including dates.
Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here. I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences. There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too. On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form. E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words. There are some other cases that look like errors in the original data, though. E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice. Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2. There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output. This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.
I also then moved on to a new script that looks at monosemous forms. This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words. For each word the script queries the OED lexeme table to get a count of the number of times the word appears. Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above. Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed. If an OED word only appears once (i.e. is monosemous) it appears in bold text. For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories. All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat. Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t. For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed. There are lots of monosemous words that just don’t appear in the HT data. These might be new additions or we might need to try pattern matching. Also, sometimes words that are monosemous in the OED data are polysemous in the HT data. These are marked with a red background in the data (as opposed to green for unique matches). Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’. Any category that has a monosemous OED word that is polysemous in the HT has a red border. I also added in some stats below the table. In our unmatched OED categories there are 24184 monosemous forms. There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form. There are 220 OED monosemous forms that are polysemous in the HT. Now we just need to decide how to use this data.
Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up. I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week. I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time. I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works. It looks like my involvement with this project might be starting up again fairly soon.
On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together. I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend). This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project. This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus. Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.
This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time. I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app. I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on. I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website. I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies. I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly. I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley. I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.
I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website. There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through. I also need to look into developing an overarching timeline with contextual events.
I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock. Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image. I updated the site so that users can no longer right click on an image to save or view it. This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images. If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image. I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.
I also spent a bit of time continuing to work on the Bilingual Thesaurus. I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously. This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’. I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me. I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”. My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”). I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one. If needs be I’ll change this. Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus. I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this. I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout. Next week I’ll see about starting to integrate the data itself.
This just leaves the big project of the week to discuss: the ongoing work to align the HT and OED datasets. I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week. Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category. I created a further script that checks how many lexemes in these potentially matched categories are the same. This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped). A percentage of the number of HT words that are matched is also displayed. If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green. If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match. If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match. The total matches given are 1154 out of 2412.
I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process. There are 1407 manual matches in the system. Of these:
- 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
- There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
- There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
- There are 18 part of speech mismatches
- There are 267 rows where nothing matches
I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches. The following patterns were attempted:
- inhabitant of the -> inhabitant
- inhabitant of -> inhabitant
- relating to -> pertaining to
- spec. -> specific
- spec -> specific
- specific -> specifically
- assoc. -> associated
- esp. -> especially
- north -> n.
- south -> s.
- january -> jan.
- march -> mar.
- august -> aug.
- september -> sept.
- october -> oct.
- november -> nov.
- december -> dec.
- Levenshtein difference of 1
- Adding ‘ing’ onto the end
The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum. Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.
On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week. It feels like real progress is being made.
I returned to work this week after two and a bit weeks off on holiday. Monday this week was a bank holiday so it was a four-day week for me, and in actual fact I worked these four days over the past two weeks whilst I was away. Despite this I managed to get quite a lot done, although I possibly ended up working more than my required four days. Over the course of my holiday I dealt with a number of smaller issues that staff had contacted me about. This included some Apple developer account duties, mainly setting up user accounts to allow someone in Computing Science to create apps, sorting out adding support for ‘markdown’ syntax to the SPADE website, responding to some queries from Chris McGlashan about some old websites, talking to Luca Guariento about possible job opportunities and the DH2018 conference that Luca attended, and responding to a query about the DSL website for Rhona Alcorn. I also had a chat with Marc about office arrangements. I’m apparently being booted out of 13 University Gardens sometime later on in the summer and will be given an office somewhere else around University Gardens. I’ll just need to see how that works out.
Other than these matters I worked on three projects during these four days: DSL, Historical Thesaurus and SCOSYA. For DSL Thomas Widmann had been in touch to say that the DSL theme I’d created for the WordPress powered version of the DSL website that is currently in development was not allowing nested menus that were more than two levels deep. The issue is the DSL WordPress theme I created was purely a WordPress version of the existing DSL website interface, and this only has two levels of menu, top and sub. It doesn’t support any additional levels of nesting. Therefore, when such nesting is specified in WordPress the theme doesn’t know how to handle this.
I’d noticed that for words in subcategories only the maincat heading was being displayed in the visualisation, so I updated this to include both maincat heading and subcat heading. However, this did mean that more text bleeds into the timeline area, which Marc didn’t like very much. I wondered whether the full category information could instead appear in a tooltip, and Fraser suggested truncating the heading after a certain number of characters so as to avoid the text bleeding into the visualisation. I implemented this, adding in ellipsis where the labels were too long. However, I didn’t add a tooltip to the labels as these already have a click action on them (opening the category) and I decided that adding a tooltip would confuse matters, especially as we’re using ‘on click’ tooltips elsewhere in the visualisation and we’d have to use ‘hover-over’ here. Also ‘hover-over’ doesn’t work on touchscreens. However, I don’t think we actually need tooltips on the labels, as the full title is already available in the tooltip for the visualisation bar / spot. Click on a bar / spot and you get to see the full title, which I think works ok, as you can see from the following screenshot:
Also for the HT, Fraser wanted me to run a query that would identify and export all of the OED lexemes that are marked as ‘revised’ but don’t currently match an HT lexeme. The last time I looked at the HT / OED data was in October last year so it took a little time to get my head round the data again. Once I had familiarised myself with it I managed to run a query that extracted the data for 18,939 unmatched but revised lexemes.
For SCOSYA I continued to work on the group statistics feature for the atlas. By the end of the week I’d managed to implement pretty much everything that I had included in my specification document for the feature. This includes showing and hiding groups on the map, editing a group (allowing the group name and the locations to be edited) and deleting a group. I also updated the attribute search so that when a new search is executed the visible group remains selected.
The only issue seems to be related to scalability. I had only been working with groups of 3-10 locations whereas Gary had created a group that featured all locations. He reported problems accessing the stats with larger groups, so I’ll need to look into this next week, and also add in features to download the data.
With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while. I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place. This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points. There are still some areas where I need further input from Faye, but we do at least have a first draft now.
I also created a project website for Anna McFarlane’s British Academy funded project. The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good. After sorting that out I then returned to the REELS project. I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end. It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.
I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project. Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files. Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6. This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.
I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file. With this in place I set the script running on the entire EEBO directory. I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.
My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database. Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct. Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point. And even more annoyingly it didn’t fail in an orderly manner. E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.
I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with. However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index. I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact. I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere. Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data. But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.
The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this. Chris said he’d sort a temporary solution out for me, which is great. I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table. After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection. Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together. For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.
Whilst working with the data I noticed that a significant amount of it is unusable. Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger. A lot of these are mis-classified words that have an asterisk or a dash at the start. If the asterisk / dash had been removed then the word could have been successfully tagged. E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’. Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.
Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used. The output has one row per heading and a column for each of the top 10 (or less if there are less than 10). This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625. I’ve sent this to Fraser and once he gets back to me I’ll proceed further.
In addition to the above big tasks, I also dealt with a number of smaller issues. Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him. I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites: For SWAP I deleted the input forms as these were sending spam to Carole. I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.
I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP. This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites. Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus. There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine. Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site. Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.
I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon. Gary is going to try and set up a meeting with Jennifer about this next week. On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised. There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project. It was really interesting to hear about these projects and their approaches to managing transcriptions.
This was the third week of the strike action and I therefore only worked on Friday. I started the day making a couple of further tweaks to the ‘Storymap’ for the RNSN project. I’d inadvertently uploaded the wrong version of the data just before I left work last week, which meant the embedded audio players weren’t displaying, so I fixed that. I also added a new element language to the REELS database and added the new logo to the SPADE project website (see http://spade.glasgow.ac.uk/).
With these small tasks out of the way I spent the rest of the day on Historical Thesaurus and Linguistic DNA duties. For the HT I had previously created a ‘fixed’ header that appears at the top of the page if you start scrolling down, so you can always see what it is you’re looking at, and also quickly jump to other parts of the hierarchy. You can also click on a subcategory to select it, which adds the subcategory ID to the URL, allowing you to quickly bookmark or cite a specific subcategory. I made this live today, and you can test it out here: http://historicalthesaurus.arts.gla.ac.uk/category/#id=157035. I also fixed a layout bug that was making the quick search box appear in less than ideal places on certain screen widths and I also updated the display of the category and tree on narrow screens: Now the tree is displayed beneath the category information and a ‘jump to hierarchy’ button appears. This in combination with the ‘top’ button makes navigation much more easy on narrow screens.
I then started looking at the tagged EEBO data. This is a massive dataset (about 50Gb of text files) that contains each word on a subset of EEBO that has been semantically tagged. I need to extract frequency data from this dataset – i.e. how many times each tag appears both in each text and overall. I have initially started to tackle this using PHP and MySQL as these are the tools I know best. I’ll see how feasible it is to use such an approach and if it’s going to take too long to process the whole dataset I’ll investigate using parallel computing and shell scripts, as I did for the Hansard data. I managed to get a test script working that managed to go through one of the files in about a second, which is encouraging. I did encounter a bit of a problem processing the lines, though. Each line is tab delimited and rather annoyingly, PHP’s fgetcsv function doesn’t treat ‘empty’ tabs as separate columns. This was giving me really weird results as if a row had any empty tabs the data I was expecting to appear in columns wasn’t there. Instead I had to use the ‘explode’ function on each line, splitting it up by the tab character (\t), and this thankfully worked. I still need confirmation from Fraser that I’m extracting the right columns, as strangely there appear to be thematic heading codes in multiple columns. Once I have confirmation I’ll be able to set the script running on the whole dataset (once I’ve incorporated the queries for inserting the frequency data into the database I’ve created).
This was my first full, five-day week back after the Christmas holidays, and I spent the majority of it continuing to work on the new timeline visualisation for the Historical Thesaurus, plus some other interface updates that were proposed during the meeting Marc, Fraser and I had last week. I managed to make quite a bit of progress on the visualisation and also the way in which dates are stored in the underlying database. The HT has many different date fields, but the main ones are ‘firstd’, ‘midd’, and ‘lastd’. Each of these has a second ‘b’ field where a potential second, later date can be added, which gives (for example) ‘1400/50’ as a date. These ‘b’ fields generally (but not always) contain dates as two, or even one-digit numbers, so in the previous example the ‘b’ field just holds ‘50’ and not ‘1450’. If a date was ‘1400/6’ the ‘b’ field might just have a ‘6’ in it, while if a date was 1395/1410 all four digits would be stored in the ‘b’ field. The current setup is therefore inconsistent and makes it difficult for scripts to work with and we decided to update the ‘b’ fields to always use four digits. I wrote a script to do this, and successfully updated all of the ‘b’ dates. I also then updated the timeline visualisation to always use the ‘b’ date for the end date of a timeline, if it existed. I then wrote two further scripts, one to check that all ‘b’ dates are actually after the main dates (it turns out there are a handful that aren’t, or are identical to the main date), and the other to list all of the words that have a ‘b’ date that is less then five years away from the main date, as in such cases it is likely that the date should actually just be a ‘circa’ instead.
I also wrote some further checking scripts for dates, including one to pull out all occasions where the fields connecting dates together (with can either be a dash to indicate a range or a plus to indicate separate occurrences) have two dashes in a row, or where there is a final dash where the word is set as ‘current’. These are probably errors as it means two ranges are next to each other, which shouldn’t happen. E.g. ‘1200-1400-1600’, or ‘1600-1800-‘ don’t make much sense. Another date checking script I wrote was to find all words that have a ‘plus’ connecting dates together (e.g. ‘1400 + 1800’) where the amount of time between the two dates is less than 150 years. There was a rule when compiling the HT that if there were less than 150 years between dates these shouldn’t be treated as a ‘plus’ gap. There were quite a few words that had a gap of less than 150 years and I send the resulting output of my script to Fraser and Marc for them to check through.
In the test version, when the top of the category heading section scrolls off the page the fixed header fades in, and when it scrolls into view again the fixed header fades out. Currently the header takes up the full width of the screen and has the same background colour as the main HT banner. I’ve also added in the HT logo, which you can click to return to the homepage. It’s a bit fuzzy looking in Chrome (but not other browsers), though. The heading displays the noun hierarchy for the current category, which reflects the tree structure that is currently open on the page. You can click on any level in the hierarchy to jump to it. The current category’s Catnum, PoS and Heading are also displayed. After some helpful feedback from Fraser I also added in a means of selecting a subcategory and for the subcategory hierarchy to be added to the fixed header too, which works as follows:
- Clicking on a subcategory gives its box a yellow border, which I think is pretty useful as you can then scroll about the page and quickly find the thing you’re interested in again.
- Clicking on the box also replaces the ID in the URL with the subcat URL, so you can now much more easily bookmark a subcat, or share the URL. Previously you had to open the ‘cite’ box for the subcat to get the URL for a specific subcat.
- Clicking on a highlighted subcat removes the highlighting, in case you don’t like the yellow. Note that this does not currently reset the ID in the URL to the maincat URL, but I think I will update this.
- Highlighting a category adds the subcat hierarchy to the fixed header so you can see at a glance the pathway from the very top of the HT to the subcat you’re looking at.
- When you follow a URL to a subcat ID the subcat is automatically highlighted and the subcat hierarchy is automatically added to the fixed header, in addition to the page scrolling to the subcat (as it previously did).
I think this will all be very helpful to users, and although it is not currently live, here is a screenshot showing how it works:
Returning to the timeline, I have changed the x axis so that it now starts at 1100 rather than 1000. The 1100 label now displays as ‘OE*’ and if you click on it you now get the same message that is displayed on the MM timeline, namely “The English spoken by the Anglo-Saxons before c.1150, with the earliest written sources c.700”. OE words on the timeline are no longer displayed as dots but instead have rectangles starting at the left edge of the visualisation and ending at 1150. Once I figure out how to add in curved and pointy ends these will be given a pointy arrow on the left and a curve on the right. I also added in faint horizontal lines between the individual timelines, to help keep your eye in a line. Here’s an example of how things currently look:
I also started to investigate how to add in these ‘curved’ and ‘pointy’ ends to the rectangles in the timeline. This is going to be rather tricky to implement as it means reverse engineering and then extending the timeline library I’m using, and also trying to figure out just how to give rectangles curved edges in D3, or how to append an arrow to a rectangle. I’ll also need to find a way to pass data about ‘circa’ and ‘ante’ dates to the timeline library. Thankfully I made a bit of progress on all of this. It turns out I can add any additional fields that I want to the timeline’s JSON structure, so adding in ‘circa’ fields etc. will not be a problem. Also, the timeline library’s code is pretty well structured and easy to follow. I’ve managed to update it so that it checks for my ‘circa’ fields (but doesn’t actually do anything about them yet). Also, there are ways of giving rectangles rounded corners in D3 (e.g. https://bl.ocks.org/mbostock/3468167) so this might work ok (although it’s not quite so simple as I will need to extend the rectangle beyond its allotted space in the timeline before the curves start). Arrows still might prove tricky, though. I’ll continue with this next week.
Other than HT related work I did a few other bits and bobs. I met with Graeme to discuss a UTF8 issue he was experiencing with a database of his. I met with Megan Coyer to discuss an upcoming project that will involve OCR, I had a chat with Luca about a Technical Plan he is putting together, I responded to a request from Stuart Gillespie about a URL he needs to incorporate into a printed volume, I helped Craig Lamont out with an issue relating to Google Analytics for the ‘Edinburgh’s Enlightenment’ site we put together a while back, I tracked down some missing sound files for the SPADE project and read through and gave feedback on a document Rachel had written about setting up Polyglot, and I had a conversation with Eleanor Lawson and Jane Stuart-Smith about future updates to the Seeing Speech website. All in all it’s been a pretty busy week.
I mostly split my time this week between three projects: The Dictionary of the Scots Language, SPADE and SCOSYA. For DSL I managed to complete the initial migration of the DSL website to WordPress and all pages and functionality have now been transferred over. Here’s what’s in place so far:
- I have created a WordPress theme replicating the DSL website interface
- I have updated the ‘compact’ menu that gets displayed on narrow screens so that it looks more attractive
- I have replaced the existing PNG icon files with scalable Font Awesome icons, which look a lot better on high resolution screens like iPads. In addition I’ve added a magnifying glass icon to search buttons.
- I have created WordPress widgets for the boxes on the front page, which can be editable via the WordPress Admin interface. DSL quick search displays the quick search box. DSL welcome text contains the HTML of the welcome box, which can be edited. DSL word of the day is what you’d expect, and DSL announcement text and social media links contain the HTML of these boxes, which can also be edited. I decided against tying the ‘announcement’ section into the ‘news’ posts as I figured it would be better to manually control the contents here rather than always have it updating to reflect the most recent news item. Any of the widgets can be dragged and dropped into the front page widget areas and the order of them can also be changed. My DSL widgets can also be added to the standard sidebar widget area too, and I added the ‘word of the day’ feature to this area for now.
- The ‘core’ dictionary pages are not part of WordPress but ‘hook’ into WordPress to grab the current theme and to format the header and footer of the page. So, for example, the ‘Advanced search’ page and the ‘entry’ page are not found anywhere in the WordPress admin interface, but if you add a new page to the site menu then these non-Wordpress pages will automatically reflect these changes.
- I created a ‘News’ page as the ‘blog’ page and content can be added to this or edited by using the ‘Posts’ menu in the admin interface. Currently all news items get listed on one page and you can’t click through to individual news items, but I might change this.
- All other pages are WordPress pages, and can be edited (or new ones created) via the ‘Pages’ menu in the admin interface. I have migrated all of the DSL pages across, as some of these are rather structurally complicated due to there being lots of custom HTML involved.
- I have created one additional SLD page: ‘About SLD’. I copied the contents from the page on the current SLD site – just highlighted the text in my browser and pasted it in and the formatting and links carried over. I did this mainly to show the SLD people how such pages can be migrated across.
- I created two page templates that are available when you create or edit a page. These can be selected in the ‘page attributes’ section on the right of the page. The default template is used for the main DSL pages and it is full width – it doesn’t feature the WordPress sidebar with widgets in it. The other template is called ‘With Sidebar’ and it allows you to create pages that display the sidebar. The sidebar will feature any widgets added to it via the ‘widgets’ menu. It took a bit of time to figure out how to create multiple page templates but once I figured it out it’s actually really simple: You just make a new version of the index.php page and call it something else (e.g. index-2col.php) and then add text like the following at the very top of the file: <?php /* Template Name: With Sidebar */ ?>. Then you can make whatever changes you want to the page design (e.g. changing the number of columns) and the user can select this template via the ‘page attributes’ section. I needed to update which template file some of my structural elements were found in, so that I could include or exclude my side column, but with these changes in place the different layout options all worked perfectly.
I emailed Thomas and Ann at SLD about the new version of the site and they are going to play around with it for a while and get back to me. I think I’ve finished work on this rather more swiftly than they were expecting so it may be a while before any further work is done on this.
For the SPADE project I met up with Rachel again and we spent a morning continuing to work on the Polyglot server. Following some helpful advice from the team in Montreal we managed to make some good progress this week with running some sample texts through the Polyglot system. Rachel had picked out six audio files and accompanying text grid files from the ‘sounds of the city’ project and we prepared these to be run through the system, and updated the system config files so it knew where to find the files, and where to place the outputs. With it all set up we ran the script for extracting sibilant data, and it began processing the files. This took some time but progress was looking good, until we received an error message stating that PRAAT had encountered errors when processing consonants. As the error message said ‘please forward the details to the developers’ we did just that. We received some further replies but as of yet we’ve not managed to get the script to work and we’ll need to return to this next week. Progress is being made, at least.
For SCOSYA I finally managed to return to updating the Atlas interface. The Atlas uses the Leaflet.js mapping library, and when I first put the Atlas together the current stable version was 0.7. It’s now up to version 1.2 and I’ve upgraded the library to this version. As there are some fairly major differences between versions it’s taken me some time to get things working, but it’s been worth doing. With the previous version the tooltip hover-overs had to be implemented by a further library, but these are now included in the default Leaflet library, and look a bit nicer. Another advantage is the sidebar will now scroll using the mouse wheel when it’s taller than the page. I’m also now using a more up to date version of the Leaflet data visualisation toolkit library, which produces the polygon markers. These things may seem rather minor but making the update will ensure the atlas continues to function for longer.
I’ve also updated the attribute search. Previously when a code had multiple attributes the code ended up appearing multiple times in the drop-down list. This gave the impression that selecting one or other of the options would give different results, but that wasn’t the case – e.g.A8 has attributes NPPDoes and NSVA so A8 appeared twice in the list. But whether you select one or the other the search is still for A8. I’ve now amalgamated any duplicates. This actually proved to be more complicated to sort out than I’d expected. Basically I had to create a new search type in the API to search based on code rather than attribute ID. I then needed to update the Atlas search to work with this new option.
I’ve also updated the ‘or’ search to get rid of star markers, as requested, and to limit the number of differently sided polygons that are used. The ‘or’ search still brings back random colours and shapes each time, but I might change this. I made all of these changes live in the CMS and also the ‘atlas guest’ URL and I’m hoping to start work on the ‘advanced attribute search’ next week.
I had a bit of an unsettled week this week as one of our two cats was hit by a car and killed on Monday night. It’s all very sad. Anyway, in terms of work, I was on an interview panel on Tuesday so spent some of Monday afternoon preparing for this, together with a fair amount of Tuesday morning participating in the interviews. On Monday I also ran a few more queries for Fraser relating to the HT and OED data matching. I also had a few more WordPress administrative tasks to take care of. I also had to spend some of this week on AHRC review duties.
In addition, I spent about a day working on the SPADE project. Rachel, the project RA’s PC had finally been set up and the Linux subsystem for Windows 10 had been activated by Arts IT support. Our task was to install and configure the Polyglot server and database software that will be used by the project, and is being developed by our Montreal partners. Thankfully there was quite a lot of documentation about this process and we could follow the many steps that were required to install the software and its dependencies. I was a little sceptical that we would be able to do all of this without needing further administrator access on Rachel’s PC (if we need administrator access we have to ask someone in Arts IT Support to come over and entire their password), but rather wonderfully, once you’ve set up the Linux subsystem it basically works as a virtual machine, with its own admin user that Rachel is in control of. I have to say it was a little strange working from a Linux command prompt in Windows, knowing that this was all running locally rather than connecting to a remote server. The polyglot server sets up a Django web server through which various endpoints can be accessed. I wondered whether it would be possible to access this ‘local’ sever from a web browser in the main Windows 10 instance and the answer was yes, it most certainly is possible. So I think this setup is going to work rather well – Rachel will just need to open up the Linux subsystem command prompt and start the server running. After that she will be able to access everything through her browser in Windows.
We did, however, run into a few difficulties with the installation process, specifically relating to the setting up of the Polyglot database to which the server connects. The documentation got a little shaky at this point and it was unclear whether by installing the server we had also automatically installed the database, or whether we still needed to manually get this set up. We contacted Montreal and were quickly told that we didn’t need to install the database separately, which was good. We’re now at the stage where we can try to start running some tests on some sample data, although once more we’re not entirely sure how to proceed. It’s a bit tricky when we don’t actually know exactly what the software does and how it does it. It would have been useful to have had a demo of the system before we tried to set up our own. We’ll press on with the test scripts next week. Also for SPADE this week I extracted some data from the SCOTS corpus that had been missed out of the dataset that Montreal had previously been given.
I had a meeting with Graeme to discuss some development issues, and I spent most of the rest of the week continuing with the reworking of the DSL website. I updated the ‘support DSL’ page of the live site as I’d realised it had a broken link and some out of date information on it. I then continued migrating the DSL website to WordPress. The big task this week was to handle the actual dictionary pages – i.e. the search results page, the advanced search page, the bibliography page and the actual entry page itself – all of the pages that actually connect to the DSL’s database and display dictionary data. For the most part this was a fairly straightforward process. I needed to strip out any references to my own layout scripts, incorporate a link to the WordPress system and then add in the WordPress calls that display the WordPress header, footer and sidebar content. This then means any changes to the installed WordPress theme are reflected on these dictionary pages, even though the pages themselves are not part of the WordPress instance. There were of course a few more things that needed done. I’m replacing all of the PNG icons with Font Awesome icons, so I needed to update all of the occurrences of these. I also noticed that the bibliography search results page wasn’t working properly if the user enters text and then presses ‘search’ rather than selecting an option from the ‘autocomplete’ facility. The results page loads with the ‘next’ and ‘previous’ links all messed up and not actually working. I spent some time fixing this in my new WordPress version of the site, but I’m not going to fix the live site as it’s a minor thing that I’m guessing no-one has actually even noticed. With these pages all working I spent some time testing things out and it would appear to me that the new DSL dictionary pages are all working in the same way as the old pages, which is great.
With the dictionary pages out of the way I continued migrating some of the ancillary pages to the new site. I’ve managed to get all pages other than the ‘history of scots’ pages complete. I’m about half-way through the latter, which is a massive amount of content. It’s taking some time to migrate, not just because of its length, but also because it incorporates lots of images that I need to upload to WordPress, and even a couple of OpenLayers powered maps that I needed to migrate too. Hopefully I’ll get this section of the site fully migrated early next week. After that I’ll need to think about incorporating a sidebar, and I also need to tweak a few more aspects of the site, such as the HTML titles that get displayed in the entry pages. After that I’ll need some further input from SLD about how we’re going to include the pages from the main SLD website.