I was off on Monday this week for the September Weekend holiday. My four working days were split across many different projects, but the main ones were the Historical Thesaurus and the Anglo-Norman Dictionary.
For the HT I continued with the preparations for the second edition. I updated the front-end so that multiple changelog items are now checked for and displayed (these are the little tooltips that say whether a lexeme’s dates have been updated in the second edition). Previously only one changelog was being displayed but this approach wasn’t sufficient as a lexeme may have a changed start and end date. I also fixed a bug in the assigning of the ‘end date verified as after 1945’ code, which was being applied to some lexemes with much earlier end dates. My script set the type to 3 in all cases where the last HT date was 9999. What it needed to do was to only set it to type 3 if the last HT date was 9999 and the last OED date was after 1945. I wrote a little script to fix this, which affected about 7,400 lexemes.
I also wrote a script to check off a bunch of HT and OED categories that had been manually matched by an RA. I needed to make a few tweaks to the script after testing it out, but after running it on the data we had a further 846 categories matched up, which is great. Fraser had previously worked on a document listing a set of criteria for working out whether an OED lexeme was ‘new’ or not (i.e. unlinked to an HT lexeme). This was a pretty complicated document with many different stages, and the output of the various stages needing to be outputted into seven different spreadsheets and it took quite a long time to write and test a script that would handle all of these stages. However, I managed to complete work on it and after a while it finished executing and resulted in the 7 CSV files, one for each code mentioned in the document. I was very glad that I had my new PC as I’m not sure my old one could have coped with it – for the Levenshtein tests data every word in the HT had to be stored in memory throughout the script’s execution, for example. On Friday I had a meeting with Marc and Fraser where we discussed the progress we’d been making and further tweaks to the script were proposed that I’ll need to implement next week.
For the Anglo-Norman Dictionary I continued to work on the ‘Entry’ page, implementing a mixture of major features and minor tweaks. I updated the way the editor’s initials were being displayed as previously these were the initials of the editor who made the most recent update in the changelog where what was needed were the initials of the person who created the record, contained in the ‘lead’ attribute of the main entry. I also attempted to fix an issue with references in the entry that were set to ‘YBB’. Unlike other references, these were not in the data I had as they were handled differently. I thought I’d managed to fix this, but it looks like ‘YBB’ is used to refer to many different sources so can’t be trusted to be a unique identifier. This is going to need further work.
Minor tweaks included changing the font colour of labels, making the ‘See Also’ header bigger and clearer, removing the final semi-colon from lists of items, adding in line breaks between parts of speech in the summary and other such things. I then spent quite a while integrating the commentaries. These were another thing that weren’t properly integrated with the entries but were added in as some sort of hack. I decided it would be better to have them as part of the editors’ XML rather than attempting to inject them into the entries when they were requested for display. I managed to find the commentaries in another hash file and thankfully managed to extract the XML from this using the Python script I’d previously written for the main entry hash file. I then wrote a script that identified which entry the commentary referred to, retrieved the entry and then inserted the commentary XML into the middle of it (underneath the closing </head> element.
It took somewhat longer than I expected to integrate the data as some of the commentaries contained Greek, and the underlying database was not set up to handle multi-byte UTF-8 characters (which Greek are), meaning these commentaries could not be added to the database. I needed to change the structure of the database and re-import all of the data as simply changing the character encoding of the columns gave errors. I managed to complete this process and import the commentaries and then begin the process of making them appear in the front-end. I still haven’t completely finished this (no formatting or links in the commentaries are working yet) and I’ll need to continue with this next week.
Also this week I added numbers to the senses. This also involved updating the editor’s XML to add a new ‘n’ attribute to the <sense> tag, e.g. <sense id=”AND-201-47B626E6-486659E6-805E33CE-A914EB1F-S001″ n=”1″>. As with the current site, the senses reset to 1 when a new part of speech begins. I also ensured that [sic] now appears, as does the language tag, with a question mark if the ‘cert’ attribute is present and not 100. Uncertain parts of speech are also now visible too (again if ‘cert’ is present and not 100), I increased the font size of the variant forms and citation dates are now visible. There is still a huge amount of work to do, but progress is definitely being made.
Also this week I reviewed the transcriptions from a private library that we are hoping to incorporate into the Books and Borrowing project and tweaked the way ‘additional fields’ are stored to enable the Ras to enter HTML characters into them. I also created a spreadsheet template for a recording the correspondence of Robert Burns for Craig Lamont and spoke to Eila Williamson about the design of the new Names Studies website. I updated the text on the homepage of this site, which Lorna Hughes sent me and gave some advice to Luis Gomes about a data management plan he is preparing. I also updated the working on the search results page for ‘V3’ of the DSL to bring it into line with ‘V2’ and participated in a Zoom call for the Iona project where we discussed the new website and images that might be used in the design.
This was a three-day week for me as I was participating in the UCU strike action on Thursday and Friday. I spent the majority of Monday to Wednesday tying up loose ends and finishing off items on my ‘to do’ list. The biggest task I tackled was to relaunch the Digital Humanities at Glasgow site. This involved removing all of the existing site and moving the new site from its test location to the main URL. I had to write a little script that changed all of the image URLs so they would work in the new location and I needed to update WordPress so it knew where to find all of the required files. Most of the migration process went pretty smoothly, but there were some slightly tricky things, such as ensuring the banner slideshow continued to work. I also needed to tweak some of the static pages (e.g. the ‘Team’ and ‘About’ pages) and I added in a ‘contact us’ form. I also put in redirects from all of the old pages so that any bookmarks or Google links will continue to work. As of yet I’ve had no feedback from Marc or Lorna about the redesign of the site, so I can only assume that they are happy with how things are looking. The final task in setting up the new site was to migrate my blog over from its old standalone URL to become integrated in the new DH site. I exported all of my blog posts and categories and imported them into the new site using WordPress’s easy to use tools, and that all went very smoothly. The only thing that didn’t transfer over using this method was the media. All of the images embedded in my blog posts still pointed to image files located on the old server so I had to manually copy these images over and then I wrote a script that went through every blog post and found and replaced all image URLs. All now appears to be working, and this is the first blog post I’m making using the new site so I’ll know for definite once I add this post to the site.
Other than setting up this new resource I made a further tweak to the new data for Matthew Sangster’s 18th century student borrowing records that I was working on last week. I had excluded rows from his spreadsheet that had ‘Yes’ in the ‘omit’ column, assuming that these were not to be processed by my upload script, but actually Matt wanted these to be in the system and displayed when browsing pages, but omitted from any searches. I therefore updated the online database to include a new ‘omit’ column, updated my upload script to only process these omitted rows and then changed the search facilities to ignore any rows that have a ‘Y’ in the ‘omit’ column.
I also responded to a query from Alasdair Whyte regarding parish boundaries for his Place-names of Mull and Ulva project, and investigated an issue that Marc Alexander was experiencing with one of the statistics scripts for the HT / OED matching process (it turned out to be an issue with a bad internet connection rather than an issue with the script). I’d had a request from Paul Malgrati in Scottish Literature about creating an online resource that maps Burns Suppers so I wrote a detailed reply discussing the various options.
I also spent some time fixing the issue of citation links not displaying exactly as the DSL people were wanting in the DSL website. This was surprisingly difficult to implement because the structure can vary quite a bit. The ID needed for the link is associated with the ‘cref’ that wraps the whole reference, but the link can’t be applied to the full contents as only authors and titles should be links, not geo and date tags or other non-tagged text that appears in the element. There may be multiple authors or no author so sometimes the link needs to start before the first (and only the first) author whereas other times the link needs to start before the title. As there is often text after the last element that needs to be linked the closing tag of the link can’t just be appended to the text but instead the script needs to find where this last element ends. However, it looks like I’ve figured out a way to do it that appears to work.
I devoted a few spare hours on Wednesday to investigating the Anglo-Norman Dictionary. Heather was trying to figure out where on the server the dataset that the website uses is located, and after reading through the documentation I managed to figure out that the data is stored in a Berkeley DB and it looks like the data the system uses is stored in a file called ‘entry_hash’. There is a file with this name in ‘/var/data’ and it’s just over 200Mb in size, which suggests it contains a lot of data. Software that can read this file can be downloaded from here: https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html and the Java edition should work on a Windows PC if it has Java installed. Unfortunately you have to register to access the files and I haven’t done so yet, but I’ve let Heather know about this.
I then experimented with setting up a simple AND website using the data from ‘all.xml’, which (more or less) contains all of the data for the dictionary. My test website consists of a database, one script that goes through the XML file and inserts each entry into the database, and one script that displays a page allowing you to browse and view all entries. My AND test is really very simple (and is purely for test purposes) – the ‘browse’ from the live site (which can be viewed here: http://www.anglo-norman.net/gate/) is replicated, only it currently displays in its entirety, which makes the page a little slow to load. Cross references are in yellow, main entries in white. Click on a cross reference and it loads the corresponding main entry, click on a regular headword to load its entry. Currently only some of the entry page is formatted, and some elements don’t always display correctly (e.g. the position of some notes). The full XML is displayed below the formatted text. Here’s an example:
Clearly there would still be a lot to do to develop a fully functioning replacement for AND, but it wouldn’t take a huge amount of time (I’d say it could be measured in days not weeks). It just depends whether the AND people want to replace their old system.
I’ll be on strike again next week from Monday to Wednesday.
I divided most of my time between three projects this week. For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset. On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying. This process had finished on Monday, but unfortunately things had gone wrong during the processing. I was using the PHP function ‘fgetcsv’ to extract the data a line at a time. This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database. Unfortunately some of the data contained commas. In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database. After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again. It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong. Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8. I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8. It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time. However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again. Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.
After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows. I then wrote another script that would take this data and insert it into the project’s place-name table. The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field. The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish. I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them. I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.
The other thing I tried to do was to grab the altitude for each name via the Google Maps service. This proved to be a little tricky as the service blocks you if you make too many requests all at once. Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database. Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these). I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic. Sound files must be in the MP3 format.
The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site. I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest. There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page. I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens. I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.
My third major project of the week was the Historical Thesaurus. Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out. I managed to create a script that can process any date, including associating labels with the appropriate date. Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen. As of yet nothing is inserted into the database. I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field. I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly. At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values. We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null. Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’. I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.
The script displays each lexeme in a category with its ‘full date’ column. It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain. Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.
I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead. Then the ‘midlascon’ field is used to join the mid date to the last. This is an error as ‘firlastcon’ should not be used to join first and mid dates. An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’. There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.
After getting the script to sort out the dates I then began the look at labels. I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared. However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field. E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.
Instead I abandoned the ‘label’ field and just used the ‘full date’ field. Actually, I still use the ‘label’ field to check whether the script needs to process labels or not. Here’s a description of the logic for working out where a label should be added:
The dates are first split up into their individual boxes. Then, if there is a label for the lexeme I go through each date in turn. I split the full date field and look at the part after the date. I go through each character of this in turn. If the character is a ‘+’ then I stop. If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop. Otherwise if the character is a-z I note that I’ve started the label. If I’ve started the label and the current character is a number then I stop. Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached. After that if there is any label text it’s added to the date. This process seems to work. I did, however, have to fix how labels applied to ‘current’ dates are processed. For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string. I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.
In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.
I spent quite a bit of time this week continuing to work on the systems for the Place-names of Mull and Ulva project. The first thing I did was to figure out how WordPress sets the language code. It has a function called ‘get_locale()’ that bring back the code (e.g. ‘En’ or ‘Gd’). Once I knew this I could update the site’s footer to display a different logo and text depending on the language the page is in. So now if the page is in English the regular UoG logo and English text crediting the map and photo are displayed whereas is the page is in Gaelic the Gaelic UoG logo and credit text is displayed. I think this is working rather well.
I managed to get all of the new Gaelic fields added into the CMS and fully tested by Thursday and asked Alasdair to testing things out. I also had a discussion with Rachel Opitz in Archaeology about incorporating LIDAR data into the maps and started to look at how to incorporate data from the GB1900 project for the parishes we are covering. GB1900 (http://www.gb1900.org/) was a crowdsourced project to transcribe every place-name that appears on OS maps from 1888-1914, which resulted in more than 2.5 million transcriptions. The dataset is available to download as a massive CSV file (more than 600Mb). It includes place-names for the three parishes on Mull and Ulva and Alasdair wanted to populate the CMS with this data as a starting point. On Friday I started to investigate how to access the information. Extracting the data manually from such a large CSV file wasn’t feasible so instead I created a MySQL database and wrote a little PHP script that iterated through each line of the CSV and added it to the database. I left this running over the weekend and will continue to work with it next week.
Also this week I continued to add new project records to the new Digital Humanities at Glasgow site. I only have about 30 more sites to add now, and I think it’s shaping up to be a really great resource that we will hopefully be able to launch in the next month or so.
I also spent a bit of further time on the SCOSYA project. I’d asked the university’s research data management people whether they had any advice on how we could share our audio recording data with other researchers around the world. The dataset we have is about 117GB, and originally we’d planned to use the University’s file transfer system to share the files. However, this can only handle files that are up to 20Gb in size, which meant splitting things up. And it turned out to take an awfully long time to upload the files, a process we would have to do each time the data was requested. The RDM people suggested we use the University’s OneDrive system instead. This is part of Office365 and gives each member of staff 1TB of space, and it’s possible to share uploaded files with others. I tried this out and the upload process was very swift. It was also possible to share the files with users based on their email addresses, and to set expiration dates and password for file access. It looks like this new method is going to be much better for the project and for any researchers who want to access our data. We also set up a record about the dataset in the Enlighten Research Data repository: http://researchdata.gla.ac.uk/951/ which should help people find the data.
Also for SCOSYA we ran into some difficulties with Google’s reCAPTCHA service, which we were using to protect the contact forms on our site from spam submissions. There was an issue with version 3 of Google’s reCAPTCHA system when integrated with the contact form plugin. It works fine if Google thinks you’re not a spammer but if you somehow fail its checks it doesn’t give you the option of proving you’re a real person, it just blocks the submission of the form. I haven’t been able to find a solution for this using v3, but thankfully there is a plugin that allows the contact form plugin to revert back to using reCAPTCHA v2 (the ‘I am not a robot’ tickbox). I got this working and have applied it to both the contact form and the spoken corpus form and it works for me as someone Google somehow seems to trust and for me when using IE via remote desktop, where Google makes me select features in images before the form submits.
Also this week I met with Marc and Fraser to discuss further developments for the Historical Thesaurus. We’re going to look at implementing the new way of storing and managing dates that I originally mapped out last summer and so we met on Friday to discuss some of the implications of this. I’m hoping to find some time next week to start looking into this.
We received the reviews for the Iona place-name project this week and I spent some time during the week and over the weekend going through the reviews, responding to any technical matters that were raised and helping Thomas Clancy with the overall response, that needed to be submitted the following Monday. I also spoke to Ronnie Young about the Burns Paper Database, that we may now be able to make publicly available, and made some updates to the NME digital ode site for Bryony Randall.
This week the Scots Syntax Atlas project was officially launched, and it’s now available here: https://scotssyntaxatlas.ac.uk/ for all to use. We actually made the website live on Friday last week so by the time of the official launch on Tuesday this week there wasn’t much left for me to do. I spent a bit of time on Monday embedding the various videos of the atlas within the ‘Video Tour’ page and I updated the data download form. I also ensured that North Queensferry appeared in our grouping for Fife, as it had been omitted and wasn’t appearing as part of any group. I also created a ‘how to cite’ page in addition to the information that is already embedded in the Atlas.
The Atlas uses MapBox for its base maps, a commercial service that allows you to apply your own styles to maps. For SCOSYA we wanted a very minimalistic map, and this service enabled us to create such an effect. MapBox allows up to 200,000 map tile loads for free each month, but we figured this might not be sufficient for the launch period so arranged to apply some credit to the account to cover the extra users that the launch period might attract. The launch itself went pretty well, with some radio interviews and some pieces in newspapers such as the Scotsman. We had several thousand unique users to the site on the day it launched and more than 150,000 map tile loads during the day, so the extra credit is definitely going to be used up. It’s great that people are using the resource and it appears to be getting some very positive feedback, which is excellent.
I spent some of the remainder of the week going through my outstanding ‘to do’ items for the place-names of Kirkcudbrightshire project. This is another project that is getting close to completion and there were a few things relating to the website that I needed to tweak before we do so, namely:
I completely removed all references to the Berwickshire place-names project from the site (the system was based on the one I created for Berwickshire and there were some references to this project throughout the existing pages). I also updated all of the examples in the new site’s API to display results for the KCB data. Thomas and Gilbert didn’t want to use the icons that I’d created for the classification codes for Berwickshire so I replaced them with more simpler coloured circular markers instead. I also added in the parish boundaries for the KCB parishes. I’d forgotten how I’d done this for Berwickshire, but thankfully I’d documented the process. There is an API through which the geoJSON shapes for the parishes can be grabbed: http://sedsh127.sedsh.gov.uk/arcgis/rest/services/ScotGov/AreaManagement/MapServer/1/query and through this I entered the text of the parish into the ‘text’ field and selected ‘polygon’ for ‘geometry type’ and ‘geoJSON’ for the ‘format’ and this gave me exactly what I needed. I also needed the coordinates for where the parish acronym should appear too, and I grabbed these via an NLS map that display all of the parish boundaries (https://maps.nls.uk/geo/boundaries/#zoom=10.671666666666667&lat=55.8481&lon=-2.5155&point=0,0), finding the centre of a parish by positioning my cursor over the appropriate position and noting the latitude and longitude values in the bottom right corner of the map (the order of these needed to be reversed to be used in Leaflet).
I also updated the ‘cite’ text, rearranged the place-name record page to ensure that most data appears above the map and that the extended elements view (including element certainty) appeared. I also changed the default position and zoom level of the results map to ensure that all data (apart from a couple of outliers) are visible by default and rearranged the advanced search page, including fixing the ‘parish’ part of the search. I also added the ‘download data for print’ facility to the CMS.
Also this week I met with Alasdair Whyte from Celtic and Gaelic to discuss his place-names of Mull project. I’m going to be adapting my place-names system for his project and we discussed some of the further updates to the system that would be required for his project. The largest of these will be making the entire site multilingual. This is going to be a big job as every aspect of the site will need to be available in both Gaelic and English, including search boxes, site text, multilingual place-names, sources etc. I’ll probably get started on this in the new year.
Also this week I fixed a couple of further issues regarding certificates for Stuart Gillespie’s NRECT site and set up a conference website that is vaguely connected to Critical Studies (https://spheres-of-singing.gla.ac.uk/).
I spent the rest of the week continuing with the redevelopment of the Digital Humanities Network site. I completed all of the development of this (adding in a variety of browse options for projects) and then started migrating projects over to the new system. This involved creating new icons, screenshots and banner images for each project, checking over the existing data and replacing it where necessary and ensuring all staff details and links are correct. By the end of the week I’d migrated 25 projects over, but there are still about 75 to look over. I think the new site is looking pretty great and is an excellent showcase of the DH projects that have been set up at Glasgow. It will be excellent once the new site is ready to go live and can replace the existing outdated site.
This is the last week of work for me before the Christmas holidays so there will be no further posts until the New Year. If anyone happens to be reading this I wish you a very merry Christmas.
After the disruption of the recent strike, this was my first full week back at work, and it was mostly spent making final updates to the SCOSYA website ahead of next Tuesday’s official launch. We decided to make the full resource publicly available on Friday as a soft launch so if you would like to try the public and linguist’ atlases, look at the interactive stories, view data in tables or even connect to the API you can now do it all at https://scotssyntaxatlas.ac.uk/. I’m really pleased with how it’s all looking and functioning and my only real worry is that we get lots of users and a big bill from MapBox for supplying the base tiles for the map. We’ll see what happens about that next week. But for now I’ll go into a bit more detail about the tasks I completed this week.
We had a team meeting on Tuesday morning where we finalised the menu structure for the site, the arrangement of the pages, the wording of various sections and such matters. Frankie MacLeod, the project RA, had produced some walkthrough videos for the atlas that are very nicely done and we looked at those and discussed some minor tweaks and how they might be integrated with the site. We did wonder about embedding a video in the homepage, but decided against it as E reckons ‘pivot to video’ is something people thought users did but has proved to be false. I personally never bother with videos so I kind of agree with this, but I know from previous experience with other projects such as Mapping Metaphor that other users do really appreciate videos, so I’m sure they will be very useful.
We still had the online consent form for accessing the data to finalise so we discussed the elements that would need to be included and how it would operate. Later in the week I implemented the form, which will allow researchers to request all of the audio files, or ones for specific areas. We also decided to remove the apostrophe from “Linguists’ Atlas” and to just call it “Linguist Atlas” throughout the site as the dangling apostrophe looks a bit messy. Hopefully our linguist users won’t object too much! I also added in a little feature to allow images of places to appear in the pop-ups for the ‘How do people speak in…’ map. These can now be managed via the project’s Content Management System and I think having the photos in the map really helps to make it feel more alive.
During the week I arranged with Raymond Brasas of Arts IT Support to migrate the site to a new server in preparation for the launch, as the current server hosts many different sites, is low on storage and is occasionally a bit flaky. The migration went smoothly and after some testing we updated the DNS records and the version of the site on the new server went live on Thursday. On Friday I went live with all of the updates, which is when I made changes to the menu structure and did things such as update all internal URLs in the site so that any links to the ‘beta’ version of the atlases now point to the final versions. This includes things such as the ‘cite this page’ links. I also updated the API to remove the warning asking people to not use it until the launch as we’re now ready to go. There are still a few tweaks to make next week, but pretty much all my work on the project is now complete.
In addition to SCOSYA I worked for a few other projects this week, and also attended the English Language & Linguistics Christmas lunch on Tuesday afternoon. I had a chat to Thomas Clancy about the outstanding tasks I still need to do for the Place-names of Kirkcudbrightshire project. I’m hoping to find the time to get all these done next week. I made a few updates to Stuart Gillespie’s Newly Recovered English Classical Translations annexe (https://nrect.gla.ac.uk/) site, namely adding in a translation of Sappho’s Ode and fixing some issues with security certificates. I also arranged for Carole Hough’s CogTop site to have its domain renewed for a year, as it was due to expire this month and had a chat with Alasdair Whyte from Celtic and Gaelic about his new place-names of Mull project. I’m going to be adapting the system I created for the Berwickshire Place-names project for Alasdair, which will include adding in multilingual support and other changes. I’m meeting with him next week to discuss the details. I also managed to dig out the Scots School Dictionary Google Developer account details for Ann Ferguson at DSL, as she didn’t have a record of these and needed to look at the stats.
Other than the above, I continued to work on a new version of the Digital Humanities Network site. I completed work on the new Content Management System, created a few new records (well, mostly overhauling some of the old data), and started work on the front-end. This included creating the page where project details will be viewed and starting work on the browse facilities that will let people access the projects. There’s still some work to be done but I’m pretty happy with how the new site is coming on. It’s certainly a massive improvement on the old, outdated resource. Below is a screenshot of one of the new pages:
Next week we will officially launch the SCOSYA resource and I will hopefully have enough time to tie up a few loose ends before finishing for the Christmas holidays.
I participated in the UCU strike action for all of last week and the first three days of this week, meaning I only worked on Thursday and Friday this week. I spent some of Thursday going through emails that had accumulated and tackled a few items on my ‘to do’ list. I managed to fix a couple of old websites that had lost a bit of functionality due to connecting to a remote server that had stopped accepting connections. These were the two ‘Emblems’ websites that I created about 15 years ago (http://emblems.arts.gla.ac.uk/french/ and http://emblems.arts.gla.ac.uk/alciato/) and the emblems they contain are categorised using the Iconclass classification system for art and iconography. The Iconclass terms applied to each emblem, and all associated Iconclass search functionality are stored on a server in the Netherlands, with the server at Glasgow connecting to this in order to execute an Iconclass search and display any matching results. Unfortunately the configuration of this remote server had changed and no requests from Glasgow were getting through. Thankfully Etienne Posthumus, who helped set up the system all those years ago and is thankfully still looking after the service in the Netherlands was able to suggest an alternative means of connecting, and with the update in place the site were restored to their original level of functionality.
I also did a bit of work for the DSL. Firstly I continued an ongoing discussion with Ann Fergusson about updates to the data. Whilst working on a new order for the ‘browse’ feature I had noticed a small selection of entries that didn’t have any data in their ‘headword’ column, despite having headwords in the XML entries. Ann had investigated this and suggested it might be caused by non-alphanumeric characters in the headword, but after I’d investigated this doesn’t seem to tell the whole story. It’s a very strange situation. The headwords are only missing in the data I processed from the XML files from the work in progress server (i.e. the V3 API) – they’re present in the data from the original API. Apostrophes can cause issues when inserting data into a database, but having looked through my script I can confirm that it uses an insert method that can process apostrophes successfully. Indeed, there are some 439 DOST and SND entries that contain apostrophes that have been successfully inserted. Plus the script also successfully inserted the headwords for entries such as ‘Pedlar’s Drouth’ (an entry with a blank headword) into the separate ‘forms’ table during upload, but then didn’t add the headword field containing the same data. It’s all very strange. And there’s no reason why other special characters or punctuation shouldn’t have been inserted. Plus some entries that are missing headwords don’t have special characters or punctuation, such as ‘GEORGEMAS FAIR or MARKET’. I didn’t manage to figure out why the headwords for these entries were missing, but I added them to the database, and I think I’ll just need to watch out for these entries when I process the new data when it’s ready.
My second DSL task was getting some information to Rhona about the Scots School Dictionary app. I sent her a copy of all of the sounds files contained within the app, wrote a query to list all entries that contained sound files and a tally of the number of sound files each of these has, and gave her some information about how the current version of the app stores and uses its sound files.
I also responded to Alasdair Whyte from Celtic and Gaelic who has a research fellowship to explore the place-names of Mull and is wanting to make use of the place-name system I initially created for the Berwickshire place-names project. Hopefully we’ll be able to arrange for this to happen.
On Friday I met with Gerry McKeever to discuss the interactive map I’m going to create for his ‘Regional Romanticism’ project. Gerry had sent me some sample data, consisting of about 40 entries with latitudes, longitudes, titles and text. I created an initial interactive map based on this data using the StoryMap service (https://storymap.knightlab.com/). I showed this to Gerry, but he thought it wasn’t quite flexible enough as the user is not able to control the zoom level of the map, plus he wanted a greater amount freedom to style the connecting lines as well – adding in directional arrows, for example. We decided that we would see if the maps people at NLS would let us use one of their geocoded historical maps as a base map, and that I would then create my own bespoke interface based on the ‘stories’ I’d created for the SCOSYA project. Gerry contacted the NLS people and hopefully I’ll be able to proceed with things once I have the maps.
I spent the rest of Friday continuing to rework the Digital Humanities Network website, working on a new content management system, completing work on the new underlying database, migrating data over and creating facilities to create a new project record. There’s still a lot to be done here, not just from a technical point of view but also deciding what projects should continue to be featured, and I’ll continue to work on this over the next few weeks as time allows.
This was a week of many different projects, most of which required fairly small jobs doing, but some of which required most of my time. I responded to a query from Simon Taylor about a potential new project he’s putting together that will involve the development of an app. I fixed a couple of issues with the old pilot Scots Thesaurus website for Susan Rennie, and I contributed to a Data Management Plan for a follow-on project that Murray Pittock is working on. I also made a couple of tweaks to the new maps I’d created for Thomas Clancy’s Saints Places project (the new maps haven’t gone live yet) and I had a chat with Rachel Macdonald about some further updates to the SPADE website. I also made some small updates to the Digital Humanities Network website, such as replacing HATII with Information Studies. I also had a chat with Carole Hough about the launch of the REELS resource, which will happen next month, and spoke to Alison Wiggins about fixing the Bess of Hardwick resource, which is currently hosted at Sheffield and is unfortunately no longer working properly. I also continued to discuss the materials for an upcoming workshop on digital editions with Bryony Randall and Ronan Crowley. I also made a few further tweaks to the new Seeing Speech and Dynamic Dialects websites for Jane Stuart-Smith.
I had a meeting with Kirsteen McCue and Brianna Robertson-Kirkland to discuss further updates to the Romantic National Song Network website. There are going to be about 15 ‘song stories’ that we’re going to publish between the new year and the project’s performance event in March, and I’ll be working on putting these together as soon as the content comes through. I also need to look into developing an overarching timeline with contextual events.
I spent some time updating the pilot crowdsourcing platform I had set up for Scott Spurlock. Scott wanted to restrict access to the full-size manuscript images and also wanted to have two individual transcriptions per image. I updated the site so that users can no longer right click on an image to save or view it. This should stop most people from downloading the image, but I pointed out that it’s not possible to completely lock the images. If you want people to be able to view an image in a browser it is always going to be possible for the user to get the image somehow – e.g. saving a screenshot, or looking at the source code for the site and finding the reference to the image. I also pointed out that by stopping people easily getting access to the full image we might put people off from contributing – e.g. some people might want to view the full image in another browser window, or print it off to transcribe from a hard copy.
I also spent a bit of time continuing to work on the Bilingual Thesaurus. I moved the site I’m working on to a new URL, as requested by Louise Sylvester, and updated the thesaurus data after receiving feedback on a few issues I’d raised previously. This included updating the ‘language of citation’ for the 15 headwords that had no data for this, instead making them ‘uncertain’. I also added in first dates for a number of words that previously only had end dates, based on information Louise sent to me. I also noticed that several words have duplicate languages in the original data, for example the headword “Clensing (mashinge, yel, yeling) tonne” has for language of origin: “Old English|?Old English|Middle Dutch|Middle Dutch|Old English”. My new relational structure ideally should have a language of origin / citation linked only once to a word, otherwise things get a bit messy, so I asked Louise whether these duplicates are required, and whether a word can have both an uncertain language of origin (“?Old English”) and a certain language of origin (“Old English”). I haven’t heard back from her about this yet, but I wrote a script that strips out the duplicates, and where both an uncertain and certain connection exists keeps the uncertain one. If needs be I’ll change this. Other than these issues relating to the data, I spent some time working on the actual site for the Bilingual Thesaurus. I’m taking the opportunity to learn more about the Bootstrap user interface library and am developing the website using this. I’ve been replicating the look and feel of the HT website using Bootstrap syntax and have come up with a rather pleasing new version of the HT banner and menu layout. Next week I’ll see about starting to integrate the data itself.
This just leaves the big project of the week to discuss: the ongoing work to align the HT and OED datasets. I continued to implement some of the QA and matching scripts that Marc, Fraser and I discussed at our meeting last week. Last week I ‘dematched’ 2412 categories that don’t have a perfect number of lexemes match and have the same parent category. I created a further script that checks how many lexemes in these potentially matched categories are the same. This script counts the number of words in the potentially matched HT and OED categories and counts how many of them are identical (stripped). A percentage of the number of HT words that are matched is also displayed. If the number of HT and OED words match and the total number of matches is the same as the number of words in the HT and OED categories the row is displayed in green. If the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1 this is also considered a match. If the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1 this is also considered a match. The total matches given are 1154 out of 2412.
I then moved onto creating a script that checks the manually matched data from our ‘version 1’ matching process. There are 1407 manual matches in the system. Of these:
- 795 are full matches (number of words and stripped last word match or have a Levenshtein score of 1 and 100% of HT words match OED words, or the categories are empty)
- There are 205 rows where all words match or the number of HT words is the same as the total number of matches and the count of OED words is less than or greater than the number of HT words by 1, or the number of OED words is the same as the total number of matches and the count of HT words is less than or greater than the number of OED words by 1
- There are 122 rows where the last word matches (or has a Levenshtein score of 1) but nothing else does
- There are 18 part of speech mismatches
- There are 267 rows where nothing matches
I then created a ‘pattern matching’ script, which changes the category headings based on a number of patterns and checks whether this then results in any matches. The following patterns were attempted:
- inhabitant of the -> inhabitant
- inhabitant of -> inhabitant
- relating to -> pertaining to
- spec. -> specific
- spec -> specific
- specific -> specifically
- assoc. -> associated
- esp. -> especially
- north -> n.
- south -> s.
- january -> jan.
- march -> mar.
- august -> aug.
- september -> sept.
- october -> oct.
- november -> nov.
- december -> dec.
- Levenshtein difference of 1
- Adding ‘ing’ onto the end
The script identified 2966 general pattern matches, 129 Levenshtein score 1 matches and 11 ‘ing’ matches, leaving 17660 OED categories that have a corresponding HT catnum with different details and a further 6529 OED categories that have no corresponding HT catnum. Where there is a matching category number of lexemes / last lexeme / total matched lexeme checks as above are applied and rows are colour coded accordingly.
On Friday Marc, Fraser and I had a further meeting to discuss the above, and we came up with a whole bunch of further updates that I am going to focus on next week. It feels like real progress is being made.
A brief report this week as I’m off for my Easter hols soon and I don’t have much time to write. I will be off all of next week. It was a four-day week this week as Friday is Good Friday. Last week was rather hectic with project launches and the like but this week was thankfully a little calmer. I spent some time helping Chris out with an old site that urgently needed fixing and I spent about a day on AHRC duties, which I can’t go into here. Other than that I helped Jane with the data management plan for her ESRC bid, which was submitted this week. I also had a meeting with Gavin Miller and Jenny Eklöf to discuss potential collaboration tools for medical humanities people. This was a really interesting meeting and we had a great discussion about the various possible technical solutions for the project they are hoping to put together. I also spoke to Fraser about the Hansard data for SAMUELS but there wasn’t enough time to work through it this week. We are going to get stuck into it after Easter.
I spent a day or so this week continuing to work on the Scots Thesaurus project. Last week I started to create a tool that will allow a researcher to search the Historical Thesaurus of English for a word or phrase, then select words from a list that gets returned and to then automatically search the contends of the Dictionary of the Scots Language for these terms. I completed a first version of the tool this week. In order to get the tool working I needed to get the ‘BaseX’ XML database installed on a server. The Arts Support people didn’t want to install this on a production server so they set it up for me on the Arts Testbed server instead, which is fine as it meant I had a deeper level of access to the database than I would otherwise have got from another server. Using the BaseX command-line tools I managed to create a new database for the DSL XML data and to then import this data. A PHP client is available for BaseX, allowing PHP scripts to connect to the database much in the same way as they would connect to a MySQL database and I created a test script to see how this would work, based on the FLOWR query I had experimented with last week. This worked fine and returned a set of XML formatted results.
The HTE part of the tool that I had developed last week only allowed the user to select or deselect terms before they were passed to the DSL query, but I realised this was a little too limiting – some of the HT words have letters in brackets or with dashes and may need tweaked, plus the user might want to search for additional words or word forms. For these reasons I adapted the tool to present the selected words in an editable text area before they are passed to the DSL query. Now the user can edit and augment the list as they see fit.
The BaseX database on the server currently seems to be rather slow and at least slightly flaky. During my experimentation it crashed a few times, and one of these times it somehow managed to lose the DSL database entirely and I had to recreate it. It’s probably just as well it’s not located on a production server. Having said that, I have managed to get the queries working, thus connecting up the data sources of the HTE and the DSL. For example, a user can find all of the words that are used to mean ‘golf’ in the HTE, edit the list of words and then at the click of a button search the text of the DSL (excluding citations as Susan requested) for these terms, bringing back the entry XML of each entry where the term is found. I’ve ‘borrowed’ the XSLT file I created for the DSL website to format the returned entries and the search terms are highlighted in these entries to make things easier to navigate. It’s working pretty well, although I’m not sure how useful it will prove to be. I’ll be meeting with Susan next week to discuss this.
I also spent a little time this week updating the Digital Humanities Network website. Previously it still have the look and feel of the old University website but I’ve now updated all of the pages to bring it into line with the current University website. I think it looks a lot better. I also had a further meeting with Megan Coyer this week, who is hoping to get a small grant to develop a ‘Medical Humanities Network’ base on the DH Network but with some tweaks. It was a good meeting and I think we now know exactly what is required from the website and Content Management System and how much time it will take to get the resource up and running, if we get the funding.
I spent most of the rest of the week working on the Mapping Metaphor website. Ellen had sent me some text for various pages of the site so I added this. I also continued to work through my ‘to do’ list. I finished off the outstanding tasks relating to the ‘tabular view’ of the data, for example adding in table headings, colour coding the categories that are listed and also extending the tabular view to the ‘aggregate’ level, enabling the user to see a list of all of the level 2 categories (e.g. ‘The Earth’) and see the number of connections each of these categories has to the other categories. I also added in links to the ‘drill down’ view, allowing the user to open a category while remaining in the tabular view. After completing this I turned my attention to the ‘Browse’ facility. This previously just showed metaphor connections of any strength, but I have now added a strength selector. The browse page also previously only showed the number of metaphorical connections each category has, rather than showing the number of categories within each higher level category. I’ve updated this as well now. I also had a request to allow users to view a category’s keywords from the browse page so I’ve added this facility too, using the same ‘drop-down’ mechanism that I’d previously used for the search results page. The final update I made to the browse page was to ensure that links to categories now lead to the new tabular view of the data rather than to the old ‘category’ page which is now obsolete.
After this I began working on the metaphor cards, updating the design of the cards that appear when a connecting line is clicked on in the diagram to reflect the design that was chosen on the Colloquium. I’m almost finished with this but still need to work on the ‘Start Era’ timeline and the aggregate card box. After that I’ll get the ‘card view’ of the data working.
On Friday afternoon I attended the launch event for a new ‘Video Games and Learning’ journal that my old HATII colleague Matthew Barr has been putting together. It’s called ‘Press Start’ and can be found here: http://press-start.gla.ac.uk/index.php/press-start. The launch event was excellent with a very interesting and though provoking lecture by Dr Esther McCallum-Stewart of the University of Surrey.