My son returned to school on Monday this week, marking an end to the home-schooling that began after the Christmas holidays. It’s quite a relief to no longer have to split my day between working and home-schooling after so long. This week I continued with some Data Management Plan related activities, completing a DMP for the metaphor project involving Duncan of Jordanstone College of Art and Design in Dundee and drafting a third version of the DMP for Kirsteen McCue’s proposal following a Zoom call with her on Wednesday.
I also spent some further time on the Books and Borrowing project, creating tilesets and page records for several new volumes. In fact, we ran out of space on the server. The project is digitising around 20,000 pages of library records from 1750-1830 and we’re approaching 5,000 pages so far. I’d originally suggested that we’d need about 60GB of server space for the images (3MB per image x 20,000). However, the JPEGS we’ve been receiving from the digitisation units have been generated at maximum quality / minimum compression and are around 9MB each, so my estimates were out. Dropping the JPEG quality setting down from 12 to 10 would result in 3MB files so I could do this to save space if required. However, there is another issue. The tilesets I’m generating for each image so that they can be zoomed and panned like a Google Map are taking up as much as 18MB per image. So we may need a minimum of 540GB of space (possibly 600GB to be safe): 9×20,000 for the JPEGs plus 18×20,000 for the tilesets. This is an awful lot of space, and storing image tilesets isn’t actually necessary these days of an IIIF server (https://iiif.io/about/) could be set up. IIIF is now well established as the best means of hosting images online and it would be hugely useful to use. Rather than generating and hosting thousands of tilesets at different zoom levels we could store just one image per page on the server and it would serve up the necessary subsection at the required zoom level based on the request from the client. This issue is that people in charge of servers don’t’ like having to support new software. I entered into discussions with Stirling’s IT people about the possibility of setting up an IIIF server, and these talks are currently ongoing, so in the meantime I still need to generate the tilesets.
Also this week I discussed a couple of issues with the Thesaurus of Old English with Jane Roberts. A search was bringing back some word results but when loading the category browser no content was being displayed. Some investigations uncovered that these words were in subcategories of ’02.03.03.03.01’ but there was no main category with that number in the system. A subcategory needs a main category in order to display in the tree browser and as none was available nothing was displaying. Looking at the underlying database I discovered that while there was no ’02.03.03.03.01’ main category there were two ’02.03.03.03.01|01’ subcategories: ‘A native people’ and ‘Natives of a country’. I bumped the former up from subcategory to main category and the search results then worked.
I spent the rest of the week continuing with the development of the Anglo-Norman Dictionary. I made the new bibliography pages live this week (https://anglo-norman.net/bibliography/), which also involved updating the ‘cited source’ popup in the entry page so that it displays all of the new information. For example, go to this page: https://anglo-norman.net/entry/abanduner and click on the ‘A-N Med’ link to see a record with multiple items in it. I also updated the advanced search for citations so that the ‘Citation siglum’ drop-down list uses the new data too.
After that I continued to update the Dictionary Management System. I updated the ‘View / Download Entry’ page so that the ‘Phase’ of the entry can be updated if necessary. In the ‘Phase’ section of the page all of the phases are now listed as radio buttons, with the entry’s phase checked. If you need to change the entry’s phase you can select a different radio button and press the ‘Update Phase’ button. I also added facilities to manage phase statements via the DMS. In the menu there’s now an ‘Add Phase’ button, through which you can add a new phase, and a ‘Browse Phases’ button which lists all of the active phases, the number of entries assigned to each, and an option to edit the phase statement. If there’s a phase statement that has no associated entries you’ll find an option to delete it here too.
I’m still working on the facilities to upload and manage XML entry files via the DMS. I’ve added in a new menu item labelled ‘Upload Entries’ that when pressed on loads a page through which you can upload entry XML files. There’s a text box where you can supply the lead editor initials to be added to the batch of files you upload (any files that already have a ‘lead’ attribute will not be affected) and an option to select the phase statement that should be applied to the batch of files. Below this area is a section where you can either click to open a file browser and select files to upload or drag and drop files from Windows Explorer (or other file browser). When files are attached they will be processed, with the results shown in the ‘Update log’ section below the upload area. Uploaded files are kept entirely separate from the live dictionary until they’ve been reviewed and approved (I haven’t written these sections yet). The upload process will generate all of the missing attributes I mentioned last week – ‘lead’ initials, the various ID fields, POS, sense numbers etc. If any of these are present the system won’t overwrite them so it should be able to handle various versions of files. The system does not validate the XML files – the editors will need to ensure that the XML is valid before it is uploaded. However, the ‘preview’ option (see below) will quickly let you know if your file is invalid as the entry won’t display properly. Note also that you can change the ‘lead’ and the phase statement between batches – you can drag and drop a set of files with one lead and statement selected, then change these and upload another batch. You can of course choose to upload a single file too.
When XML files are uploaded, in the ‘update log’ there will be links directly through to a preview of the entry, but you can also find all entries that have been uploaded but not yet published on the website in the ‘Holding Area’, which is linked to in the DMS menu. There are currently two test files in this. The holding area lists the information about the XML entries that have been uploaded but not yet published, such as the IDs, the slug, the phase statement etc. There is also an option to delete the holding entry. The last two columns in the table are links to any live entry. There are two columns. The first links to the entry as specified by the numerical ID in the XML filename, which will be present in the filename of all XML files exported via the DMS’s ‘Download Entry’ option. This is the ‘existing ID’ column in the table. The second linking column is based on the ‘slug’ of the holding entry (generated from the ‘lemma’ in the XML). The ‘slug’ is unique in the data so if a holding entry has a link in this column it means it will overwrite this entry if it’s made live. For XML files exported view the DMS and them uploaded both ‘live entry’ links should be the same, unless the editor has changed the lemma. For new entries both these columns should be blank.
The ‘Review’ button opens up a preview of the uploaded holding entry in the interface of the live site. This allows the editors to proofread the new entry to ensure that the XML is valid and that everything looks right. You can return to the holding area from this page by pressing on the button in the left-hand column. Note that this is just a preview – it’s not ‘live’ and no-one else can see it.
There’s still a lot I need to do. I’ll be adding in an option to publish an entry in the holding area, at which point all of the data needed for searching will be generated and stored and the existing live entry (if there is one) will be moved to the ‘history’ table. I also maybe need to extract the earliest date information to display in the preview and in the holding area. This information is only extracted when the data for searching is generated, but I guess it would be good to see it in the holding area / preview too. I also need to add in a preview of cross reference entries as these don’t display yet. I should probably also add in an option to allow the editors to view / download the holding entry XML as they might want to check how the upload process has changed this. So still lots to tackle over the coming weeks.
I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful. Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.
I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback. It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function. Also this week I investigated another bizarre situation with the AND’s data. I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries. The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system. What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system. For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>. Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>. We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere. The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with. It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.
As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors. I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed. There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors. I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.
It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014. I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.
The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS. There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.
I’m not sure why these differences exist. It’s possible I only have access to an older version of the XSLT file. I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script). What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.
I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.
I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each. I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests. After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part. I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.
Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English. Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken. URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards. They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange. The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word. However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.
I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend. This week I investigated OED dates that have a dot in them instead of a full date. There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED. E.g. 138. Is 1380, 17.. is 1700. However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag. For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag. There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’. Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.
In addition to the above I continued to work on the Books and Borrowing project. I made some tweaks to the CMS to make is easier to edit records. When a borrowing record is edited the page automatically scrolls down to the record that was edited. This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library. I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries. What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together. So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname. When stripped and joined together these both become ‘bedfordarthur’ and we have a match.
There are 162 matches that have been identified, some consisting of more than two matched author records. I exported these as a spreadsheet. Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match. This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.
The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row. E.g. for Francis Bacon the AID to keep is given as 1460. If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163. Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly. E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person. To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610. It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column. Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.
I also generated a spreadsheet containing all of the authors that appear to be unique. This will also need checking for other duplicates that haven’t been picked up as there are a few. For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’. Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’. Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’. I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.
Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made). Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible. There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query. This was thankfully easy to fix. There was also an issue with some exact searches of the full text failing to find entries. When the full text is ingested into Solr all of the XML tags are stripped out. If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’. With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’. So an exact search for ‘westminster’ fails to find this entry. A search for ‘westminsterb’ finds the entry, which confirms this. I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.
This was week 18 of Lockdown, which is now definitely easing here. I’m still working from home, though, and will be for the foreseeable future. I took Friday off this week, so it was a four-day week for me. I spent about half of this time on the Books and Borrowing project, during which time I returned to adding features to the content management system, after spending recent weeks importing datasets. I added a number of indexes to the underlying database which should speed up the loading of certain pages considerably. E.g. the browse books, borrowers and author pages. I then updated the ‘Books’ tab when viewing a library (i.e. the page that lists all of the book holdings in the library) so that it now lists the number of book holdings in the library above the table. The table itself now has separate columns for all additional fields that have been created for book holdings in the library and it is now possible to order the table by any of the headings (pressing on a heading a second time reverses the ordering). The count of ‘Borrowing records’ for each book in the table is now a button and pressing on it brings up a popup listing all of the borrowing records that are associated with the book holding record, and from this pop-up you can then follow a link to view the borrowing record you’re interested in. I then made similar changes to the ‘Borrowers’ tab when viewing a library (i.e. the page that lists all of the borrowers the library has). It also now displays the total number of borrowers at the top. This table already allowed the reordering by any column, so that’s not new, but as above, the ‘Borrowing records’ count is now a link that when clicked on opens a list of all of the borrowing records the borrower is associated with.
The big new feature I implemented this week was borrower cross references. These can be added via the ‘Borrowers’ tab within a library when adding or editing a borrower on this page. When adding or editing a borrower there is now a section of the form labelled ‘Cross-references to other borrowers’. If there are any existing cross references these will appear here, with a checkbox beside each that you can tick if you want to delete the cross reference (the user can tick the box then press ‘Edit’ to edit the borrower and the reference will be deleted). Any number of new cross references can be added by pressing on the ‘Add a cross-reference’ button (multiple times, if required). Doing so adds two fields to the form, one for a ‘description’, which is the text that shows how the current borrower links to the referenced borrowing record, and one for ‘referenced borrower’, which is an auto-complete. Type in a name or part of a name and any borrower that matches in any library will be listed. The library appears in brackets after the borrower’s name to help differentiate records. Select a borrower and then when the ‘Add’ or ‘Edit’ button is pressed for the borrower the cross reference will be made.
Cross-references work in both directions – if you add a cross reference from Borrower A to Borrower B you don’t then need to load up the record for Borrower B to add a reference back to Borrower A. The description text will sit between the borrower whose form you make the cross reference on and the referenced borrower you select, so if you’re on the edit form for Borrower A and link to Borrower B and the description is ‘is the son of’ then the cross reference will appear as ‘Borrower A is the son of Borrower B’. If you then view Borrower B the cross reference will still be written in this order. I also updated the table of borrowers to add in a new ‘X-Refs’ column that lists all cross-references for a borrower.
I spent the remainder of my working week completing smaller tasks for a variety of projects, such as updating the spreadsheet output of duplicate child entries for the DSL people, getting an output of the latest version of the Thesaurus of Old English data for Fraser, advising Eleanor Lawson on ‘.ac.uk’ domain names and having a chat with Simon Taylor about the pilot Place-names of Fife project that I worked on with him several years ago. I also wrote a Data Management Plan for a new AHRC proposal the Anglo-Norman Dictionary people are putting together, which involved a lengthy email correspondence with Heather Pagan at Aberystwyth.
Finally, I returned to the ongoing task of merging data from the Oxford English Dictionary with the Historical Thesaurus. We are currently attempting to extract citation dates from OED entries in order to update the dates of usage that we have in the HT. This process uses the new table I recently generated from the OED XML dataset which contains every citation date for every word in the OED (more than 3 million dates). Fraser had prepared a document listing how he and Marc would like the HT dates to be updated (e.g. if the first OED citation date is earlier than the HT start date by 140 years or more then use the OED citation date as the suggested change). Each rule was to be given its own type, so that we could check through each type individually to make sure the rules were working ok.
It took about a day to write an initial version of the script, which I ran on the first 10,000 HT lexemes as a test. I didn’t split the output into different tables depending on the type, but instead exported everything to a spreadsheet so Marc and Fraser could look through it.
In the spreadsheet if there is no ‘type’ for a row it means it didn’t match any of the criteria, but I included these rows anyway so we can check whether there are any other criteria the rows should match. I also included all the OED citation dates (rather than just the first and last) for reference. I noted that Fraser’s document doesn’t seem to take labels into consideration. There are some labels in the data, and sometimes there’s a new label for an OED start or end date when nothing else is different, e.g. htid 1479 ‘Shore-going’: This row has no ‘type’ but does have new data from the OED.
Another issue I spotted is that as the same ‘type’ variable is set when a start date matches the criteria and then when an end date matches the criteria, the ‘type’ as set during start date is then replaced with the ‘type’ for end date. I think, therefore, that we might have to split the start and end processes up, or append the end process type to the start process type rather than replacing it (so e.g. type 2-13 rather than type 2 being replaced by type 13). I also noticed that there are some lexemes where the HT has ‘current’ but the OED has a much earlier last citation date (e.g. htid 73 ‘temporal’ has 9999 in the HT but 1832 in the OED. Such cases are not currently considered.
Finally, according to the document, Antes and Circas are only considered for update if the OED and HT date is the same, but there are many cases where the start / end OED date is picked to replace the HT date (because it’s different) and it has an ‘a’ or ‘c’ and this would then be lost. Currently I’m including the ‘a’ or ‘c’ in such cases, but I can remove this if needs be (e.g. HT 37 ‘orb’ has HT start date 1601 (no ‘a’ or ‘c’) but this is to be replaced with OED 1550 that has an ‘a’. Clearly the script will need to be tweaked based on feedback from Marc and Fraser, but I feel like we’re finally making some decent progress with this after all of the preparatory work that was required to get to this point.
Next Monday is the Glasgow Fair holiday, so I won’t be back to work until the Tuesday.
This was week 8 of Lockdown and I spent the majority of it working on the content management system for the Books and Borrowing project. The project is due to begin at the start of June and I’m hoping to have the CMS completed and ready to use by the project team by then, although there is an awful lot to try and get into place. I can’t really go into too much detail about the CMS, but I have completed the pages to add a library and to browse a list of libraries with the option of deleting a library if it doesn’t have any ledgers. I’ve also done quite a lot with the ‘View library’ page. It’s possible to edit a library record, add a ledger and add / edit / delete additional fields for a library. You can also list all of the ledgers in a library with options to edit the ledger, delete it (if it contains no pages) and add a new page to it. You can also display a list of pages in a ledger, with options to edit the page or delete it (if it contains no records). You can also open a page in the ledger and browse through the next and previous pages.
At the moment I’m in the middle of creating the facility to add a new borrowing record to the page. This is the most complex part of the system as a record may have multiple borrowers, each of which may have multiple occupations, and multiple books, each of which may be associated with higher level book records. Plus the additional fields for the library need to be taken into consideration too. By the end of the week I was at the point of adding in an auto-complete to select an existing borrower record and I’ll continue with this on Monday.
In addition to the B&B project I did some work for other projects as well. For Thomas Clancy’s Place-names of Kirkcudbrightshire project (now renamed Place-names of the Galloway Glens) I had a few tweaks and updates to put in place before Thomas launched the site on Tuesday. I added a ‘Search place-names’ box to the right-hand column of every non-place-names page which takes you to the quick search results page and I added a ‘Place-names’ menu item to the site menu, so users can access the place-names part of the site. Every place-names page now features a sub-menu with access to the place-names pages (Browse, element glossary, advanced search, API, quick search). To return to the place-name introductory page you can press on the ‘Place-names’ link in the main menu bar. I had unfortunately introduced a bug to the ‘edit place-name’ page in the CMS when I changed the ordering of parishes to make KCB parishes appear first. This was preventing any place-names in BMC from having their cross references, feature type and parishes saved when the form was submitted. This has now been fixed. I also added Google Analytics to the site. The virtual launch on Tuesday went well and the site can now be accessed here: https://kcb-placenames.glasgow.ac.uk/.
I also added in links to the DSL’s email and Instagram accounts to the footer of the DSL site and added some new fields to the database and CMS of the Place-names of Mull and Ulva site. I also created a new version of the Burns Supper map for Paul Malgrati that included more data and a new field for video dimensions that the video overlay now uses. I also replied to Matthew Creasy about a query regarding the website for his new Scottish Cosmopolitanism project and a query from Jane Roberts about the Thesaurus of Old English and made a small tweak to the data of Gerry McKeever’s interactive map for Regional Romanticism.
It was another mostly SCOSYA week this week, ahead of the launch of the project that was planned for the end of next week. However, on Friday this week I bumped into Jennifer who said that the launch will now be pushed back into December. This is because our intended launch date was the last working day before the UCU strike action begins, and is a bad time to launch the project, for reasons of publicity, engaging with other scholars and risks associated with technical issues that might crop up which might not be able to be sorted until after the strike. As there’s a general election soon after the strike is due to end, it looks like the launch is going to be pushed back until closer to Christmas. But as none of this transpired until Friday I still spent most of the week until then making what I thought were last-minute tweaks to the website and fixing bugs that had cropped up during user testing.
This included going through all of the points raised by Gary following the testing session he had arranged with his students in New York the week before, and meeting with Jennifer, E and Frankie to discuss how we intended to act on the feedback, which was all very positive but did raise a few issues relating to the user interface, the data and the explanatory text.
Also this week I had a further chat with Luca about the API he’s building, and a DMP request that came his way, and arranged for the App and Play store account administration to be moved over the Central IT Services. I also helped Jane Roberts with an issue with the Thesaurus of Old English and had a chat with Thomas Clancy and Gilbert Markus about the Place-names of Kirkcudbrightshire project, which I set the systems up for last year and is now nearing completion and requiring some further work to develop the front-end.
I also completed an initial version of a WordPress site for Corey Gibson’s bibliography project and spoke to Eleanor Capaldi about how to get some images for her website that I recently set up. I also spent a bit of time upgrading all of the WordPress sites I manage to the latest version. Also this week I had a chat with Heather Pagan about the Anglo-Norman Dictionary data. She now has access to the data that powers the current website and gave me access to this. It’s great to finally know that the data has been retrieved and to get a copy of it to work with. I spent a bit of time looking through the XML files, but we need to get some sort of agreement about how Glasgow will be involved in the project before I do much more with it.
I had a bit of an email chat with the DSL people about adding a new ‘history’ field to their entries, something that will happen through the new editing interface that has been set up for them by another company, but will have implications for the website once we reach the point of adding the newly edited data from their new system to the online dictionary. I also arranged for the web space for Rachel Smith and Ewa Wanat’s project to be set up and spent a bit of time playing around with a new interface and design for the Digital Humanities Network website (https://digital-humanities.glasgow.ac.uk/) which is in desperate need of a makeover.
After a wonderful three weeks’ holiday I returned to work on Monday this week. I’d been keeping track of my emails whilst I’d been away so although I had a number of things waiting for me to tackle on my return at least I knew what they were, so returning to work wasn’t as much of a shock as it might otherwise have been. The biggest item waiting for me to get started on was a request from Gerry Carruthers to write a Data Management Plan for an AHRC proposal he’s putting together. He’d sent me all of the bid documentation so I read through that and began to think about the technical aspects of the project, which would mainly revolve around the creation of TEI-XML digital editions. I had an email conversation with Gerry over the course of the week where I asked him questions and he got back to me with answers. I’d also arranged to meet with my fellow developer Luca Guariento on Wednesday as he has been tasked with writing a DMP and wanted some advice. This was a good opportunity for me to ask him some details about the technology behind the digital editions that had been created for the Curious Travellers project, as it seemed like a good idea to reuse a lot of this for Gerry’s project. I finished a first version of the plan on Wednesday and sent it to Gerry, and after a few further tweaks based on feedback a final version was submitted on Thursday.
Also this week I met with Head of School Alice Jenkins to discuss my role in the School, a couple of projects that have cropped up that need my input and the state of my office. It was a really useful meeting, and it was good to discuss the work I’ve done for staff in the School and to think about how my role might be developed in future. I spent a bit of time after the meeting investigating some technology that Alice was hoping might exist, and I also compiled a list of all of the current Critical Studies Research and Teaching staff that I’ve worked with over the years. Out of 104 members of staff I have worked with 50 of them, which I think is pretty good going, considering not every member of staff is engaged in research, or if they are may not be involved with anything digital.
I spent some more time this week working on the pilot website for 18th Century Borrowers for Matthew Sangster. We met on Wednesday morning and had a useful meeting, discussing the new version of the data that Matt is working on, how my import script might be updated to incorporate some changes and investigating why some of the error rows that were outputted during my last data import were generated and how these could be addressed. We also went through the website I’d created, as Matt had uncovered a couple of bugs, such as the order of the records in the tabular view of the page not matching up with the order on the scanned image. This turned out to have been caused by the tabular order depending on an imported column that was set to hold general character data rather than numbers, meaning the database arranged all of the ones (1,10,11 etc) then all of the twos (2, 21,22 etc) rather than arranging things in proper numerical order. I also realised that I hadn’t created indexes for a lot of the columns in the database that were used in the queries, which was making the queries rather slow and inefficient. After generating these indexes the various browses are now much speedier.
I also added authors under book titles in the various browse lists, which helps to identify specific books and created a new section of the website for frequency lists. There are now three ‘top 20’ lists, which show the most frequently borrowed books and authors, and the student borrowers who borrowed the most books. Finally, I created the search facility for the site, allowing any combination of book title, author, student, professor and date of lending to be combined and for the results of the search to be displayed. This took a fair amount of time to implement, but I managed to get the URL for the page to Matt before the end of the week.
Also this week I investigated and fixed a bug that the Glasgow Medical Humanities Network RA Cris Sarg was encountering when creating new people records and adding these to the site, I responded to a query from Bryony Randall about the digital edition we had made for the New Modernist Editing project, spoke to Corey Gibson about a new project he’s set up that will be starting soon and I’ll be creating the website for, had a chat with Eleanor Capaldi about a project website I’ll be setting up for her, responded to a query from Fraser about access to data from the Thesaurus of Old English and attended the Historical Thesaurus birthday drinks. I also read through the REF digital guidelines that Jennifer Smith had sent on to me and spoke to her about the implications for the SCOYA project, helped the SCOSYA RA Frankie MacLeod with some issues she was encountering with map stories and read through some feedback on the SCOSYA interfaces that had been sent back from the wider project team. Next week I intend to focus on the SCOSYA project, acting on the feedback and possibly creating some non-map based ways of accessing the data.
After meeting with Fraser to discuss his Scots Thesaurus project last Friday I spent some time on Monday this week writing a script that returns some random SND or DOST entries that met certain criteria, so as to allow him to figure out how these might be placed into HT categories. The script brings back main entries (as opposed to supplements) that are nouns, are monosemous (i.e. no other noun entries with the same headword), have only one sense (i.e. not multiple meanings within the entry), have fewer than 5 variant spellings, have single word headwords and have definitions that are relatively short (100 characters or less). Whilst writing the script I realised that database queries are somewhat limited on the server and if I try to extract the full SND or DOST dataset to then select rows that meet the criteria in my script these limits are reached and the script just displays a blank page. So what I had to do is to set the script up to bring back a random sample of 5000 main entry nouns that don’t have multiple words in their headword in the selected dictionary. I then have to apply the other checks on this set of 5000 random entries. This can mean that the number of outputted entries ends up being less than the 200 that Fraser was hoping for, but still provides a good selection of data. The output is currently an HTML table, with IDs linking through to the DSL website and I’ve given the option of setting the desired number of returned rows (up to 1000) and the number of characters that should be considered a ‘short’ definition (up to 5000). Fraser seemed pretty happy with how the script is working.
Also this week I made some further updates to the new song story for RNSN and I spent a large amount of time on Friday preparing for my upcoming PDR session. On Tuesday I met with Luca to have a bit of a catch-up, which was great. I also fixed a few issues with the Thesaurus of Old English data for Jane Roberts and responded to a request for developer effort from a member of staff who is not in the College of Arts. I also returned to working on the Books and Borrowing pilot system for Matthew Sangster, going through the data I’d uploaded in June, exporting rows with errors and sending these to Matthew for further checking. Although there are still quite a lot of issues with the data, in terms of its structure things are pretty fixed, so I’m going to begin work on the front-end for the data next week, the plan being that I will work with the sample data as it currently stands and then replace it with a cleaner version once Matthew has finished working with it.
I divided the rest of my time this week between DSL and SCOSYA. For the DSL I integrated the new APIs that I was working on last week with the ‘advanced search’ facilities on both the ‘new’ (v2 data) and ‘sienna’ (v3 data) test sites. As previously discussed, the ‘headword match type’ from the live site has been removed in favour of just using wildcard characters (*?”). Full-text searches, quotation searches and snippets should all be working, in addition to headword searches. I’ve increased the maximum number of full-text / quotation results from 400 to 500 and I’ve updated the warning messages so they tell you how many results your query would have returned if the total number is greater than this. I’ve tested both new versions out quite a bit and things are looking good to me, and I’ve contacted Ann and Rhona to let them know about my progress. I think that’s all the DSL work I can do for now, until the bibliography data is made available.
For SCOSYA I engaged in an email conversation with Jennifer and others about how to cover the costs of MapBox in the event of users getting through the free provision of 200,000 map loads a month after the site launches next month. I also continued to work on the public atlas interface based on discussions we had at a team meeting last Wednesday. The main thing was replacing the ‘Home’ map, which previously just displayed the questionnaire locations, with a new map that highlights certain locations that have sound clips that demonstrate an interesting feature. The plan is that this will then lead users on to finding out more about these features in the stories, whilst also showing people where some of the locations to project visited are. This meant creating facilities in the CMS to manage this data, updating the database, updating the API and updating the front-end, so a fairly major thing.
I updated the CMS to include a page to manage the markers that appear on the new ‘Home’ map. Once logged into the CMS click on the ‘Browse Home Map Clips’ menu item to load the page. From here staff can see all of the locations and add / edit the information for a location (adding an MP3 file and the text for the popup). I added the data for a couple of sample locations that E had sent me. I then added a new endpoint to the API that brings back the information about the Home clips and updated the public atlas to replace the old ‘Home’ map with the new one. Markers are still the bright blue colour and drop into the map. I haven’t included the markers for locations that don’t have clips. We did talk at the meeting about including these, but I think they might just clutter the map up and confuse people.
I also reordered and relabelled the menu, and have changed things so that you can now click on an open section to close it. Currently doing so still triggers the map reload for certain menu items (e.g. Home). I’ll try to stop it doing so, but I haven’t managed to yet.
I also implemented the ‘Full screen’ slide type, although I think we might need to change the style of this. Currently it takes up about 80% of the map width, pinned to the right hand edge (which it needs to be for the animated transitions between slides to work). It’s only as tall as the content of the slide needs it to be, though, so the map is not really being obscured, which is what Jennifer was wanting. Although I could set it so that the slide is taller, this would then shift the navigation buttons down to the bottom of the map and if people haven’t scrolled the map fully into view they might not notice the buttons. I’m not sure what the best approach here might be, and this needs further discussion.
I also changed the way location data is returned from the API this week, to ensure that the GeoJSON area data is only returned from the API when it is specifically asked for, rather than by default. This means such data is only requested and used in the front-end when a user selects the ‘area’ map in the ‘Explore’ menu. The reason for doing this is to make things load quicker and to reduce the amount of data that was being downloaded unnecessarily. The GeoJSON data was rather large (several megabytes) and requesting this each time a map loaded meant the maps took some time to load on slower connections. With the areas removed the stories and ‘explore’ maps that are point based are much quicker to load. I did have to update a lot of code so that things still work without the area data being present, and I also needed to update all API URLs contained in the stories to specifically exclude GeoJSON data, but I think it’s been worth spending the time doing this.
Last Friday afternoon I met with Charlotte Methuen to discuss a proposal she’s putting together. It’s an AHRC proposal, but not a typical one as it’s in collaboration with a German funding body and it has its own template. I had agreed to write the technical aspects of the proposal, which I had assumed would involve a typical AHRC Data Management Plan, but the template didn’t include such a thing. It did however include other sections where technical matters could be added, so I wrote some material for these sections. As Charlotte wanted to submit the proposal for internal review by the end of the week I needed to focus on my text at the start of the week, and spent most of Monday and Tuesday working on it. I sent my text to Charlotte on Tuesday afternoon, and made a few minor tweaks on Wednesday and everything was finalised soon after that. Now we’ll just need to wait and see whether the project gets funded.
I also continued with the HT / OED linking process this week as well. Fraser had clarified which manual connections he wanted me to tick off, so I ran these through a little script that resulted in another 100 or so matched categories. Fraser had also alerted me to an issue with some OED categories. Apparently the OED people had duplicated an entire branch of the thesaurus (03.01.07.06 and 03.01.04.06) but had subsequently made changes to each of these branches independently of the other. This means that for a number of HT categories there are two potential OED category matches, and the words (and information relating to words such as dates) found in each of these may differ. It’s going to be a messy issue to fix. I spent some time this week writing scripts that will help us to compare the contents of the two branches to work out where the differences lie. First of all I wrote a script that displays the full contents (categories and words) contained in an OED category in tabular format. For example, passing the category 03.01.07.06 then lists the 207 categories found therein, and all of the words contained in these categories. For comparison, 03.01.04.06 contains 299 categories.
I then created another script that compares the contents of any two OED categories. By default, it compares the two categories mentioned above, but any two can be passed, for example to compare things lower down the hierarchy. The script extracts the contents of each chosen category and looks for exact matches between the two sets. The script looks for an exact match of the following in combination (i.e. all must be true):
- length of path (so xx.xx and yy.yy match but xx.xx and yy.yy.yy don’t)
- length of sub (so a sub of xx matches yy but a sub of xx doesn’t match xx.yyy)
- Stripped heading
In such cases the categories are listed in a table together with their lexemes, and the lexemes are also then compared. If a lexeme from cat1 appears in cat2 (or vice-versa) it is given a green background. If a lexeme from one cat is not present in the other it is given a red background, and all lexemes are listed with their dates. Unmatched categories are listed in their own tables below the main table, with links at the top of the page to each. 03.01.04.06 has 299 categories and 03.01.07.06 has 207 categories. Of these there would appear to be 209 matches, although some of these are evidently duplicates. Some further investigation is required, but it does at least look like the majority of categories in each branch can be matched.
I also updated the lists of unmatched categories to incorporate the number of senses for each word. The overview page now gives a list of the number of times words appear in the unmatched category data. Of the 2155 OED words that are currently in unmatched OED categories we have 1763 words with 1 unmatched sense, 232 words with 2 unmatched senses, 75 words with 3 unmatched senses, 18 words with 6 unmatched senses, 36 words with 4 unmatched senses, 15 words with 5 unmatched senses and 16 words with 8 unmatched senses. I also updated the full category lists linked to from this summary information to include the count of senses (unmatched) for each individual OED word, so for example for ‘extra-terrestrial’ the following information is now displayed: extra-terrestrial (1868-1969 [1963-]) [1 unmatched sense].
Also this week I tweaked some settings relating to Rob Maslen’s ‘Fantasy’ blog, investigated some categories that had been renumbered erroneously in the Thesaurus of Old English and did a bit more investigation into travel and accommodation for the Bergamo conference.
I split the remainder of my time between RNSN and SCOSYA. For RNSN I had been sent a sizable list of updates that needed to be made to the content of a number of song stories, so I made the necessary changes. I had also been sent an entirely new timeline-based song story, and I spent a couple of hours extracting the images, text and audio from the PowerPoint presentation and formatting everything for display in the timeline.
For SCOSYA I spent some time further researching Voronoi diagrams and began trying to update my code to work with the current version of D3.js. It turns out that there have been many changes to the way in which D3 implements Voronoi diagrams since the code I based my visualisations on was released. For one thing, ‘d3-voronoi’ is going to be deprecated and replaced by a new module called d3-delaunay. Information about this can be found here: https://github.com/d3/d3-voronoi/blob/master/README.md. There is also now a specific module for applying Voronoi diagrams to spheres using coordinates, called d3-geo-voronoi (https://github.com/Fil/d3-geo-voronoi). I’m now wondering whether I should start again from scratch with the visualisation. However, I also received an email from Jennifer raising some issues with Voronoi diagrams in general so we might need an entirely different approach anyway. We’re going to meet next week to discuss this.
Also this week I created an updated of a Data Management Plan for Thomas Clancy (the fourth version and possibly the last), updated the test version of the SCOSYA atlas to limit the attributes contained in it before a class uses the interface next week, made some further (and possibly final) tweaks to the Bilingual Thesaurus and migrated the Thesaurus of Old English to our new Thesaurus domain and ensured the site and its data all worked in the new location, which thankfully it did. I’ll need to hear back from Marc and Fraser before I make this new version ‘live’, though. I also made a small tweak to the DSL website, started to think about the next batch of updates that will be required to link up the HT and OED data, and did some App store related duties for someone elsewhere in the University. So all in all a pretty busy week.
This week marked the start of the UCU’s strike action, which I am participating in. This meant that I only worked from Monday to Wednesday. It was quite a horribly busy week as I tried to complete some of the more urgent things on my ‘to do’ list before the start of the strike, while other things I had intended to complete unfortunately had to be postponed. I spent some time on Monday writing a section containing details about the technical methodology for a proposal Scott Spurlock is intending to submit to the HLF. I can’t really say too much about it here, but it will involve crowd sourcing and I therefore had to spend time researching the technologies and workflows that might work best for the project and then writing the required text. Also on Monday I discovered that the AHRC does now have some guidance on its website about the switchover from Technical Plans to Data Management Plans. There are some sample materials and accompanying support documentation, which is very helpful. This can currently be found here: http://www.ahrc.ac.uk/peerreview/peer-review-updates-and-guidance/ although this doesn’t look like it will be a very permanent URL. Thankfully there will be a transition period up to the 29th of March when proposals can be submitted with either a Technical Plan or a DMP. This will make things easier for a few projects I’m involved with.
Also on Monday Gary Thoms contacted me to say there were some problems with the upload facilities for the SCOSYA project, so I spent some time trying to figure out what was going on there. What has happened is that Google seem to have restricted access to their geocoding API, which the upload script connects to in order to get the latitude and longitude of the ‘display town’. Instead of returning data, Google was returning an error saying we had exceeded our quota of requests. This was because previously I was just connecting to their API without registering for an API key, which used to work just fine but now is intermittent. Keep refreshing this page: https://maps.googleapis.com/maps/api/geocode/json?address=Aberdour+scotland and you’ll see it returns data sometimes and an error about exceeding the quota other times.
After figuring this out I created an API account for the project with Google. If I pass the key they gave me in the URL this now bypasses the restrictions. We are allowed up to 2,500 requests a day and up to 5000 requests in 100 seconds (that’s what they say – not sure how that works if you’re limited to 2,500 a day) so we shouldn’t encounter a quota error again.
Thankfully the errors Gary was encountering with a second file turned out to be caused by typos in the questionnaire – an invalid postcode was given. There were issues with a third questionnaire, which was giving an error on upload without stating what the error was, which was odd as I’d added in some fairly comprehensive error handling. After some further investigation it turned out to be caused by the questionnaire containing a postcode that didn’t actually exist. In order to get the latitude and longitude for a postcode my scripts connect to an external API which then returns the data in the ever so handy JSON format. However, a while ago the API I was connecting to started to go a bit flaky and for this reason I added in a connection to a second external API if the first one gave a 404. But now the initial API I used has completely gone offline, and is taking ages to even return a 404, which was really slowing down the upload script. Not only that but the second API didn’t handle ‘unknown’ postcode errors in the same way. The first API returned a nice error message but the second one just returned an empty JSON file. This meant my error handler wasn’t picking up that there was a postcode error and thus giving no feedback. I have now completely dropped the first API and connect directly to the second one, which speeds up the upload script dramatically. I have also updated my error handlers so it knows how to handle an empty JSON file from this API.
On Tuesday I fixed a data upload error with the Thesaurus of Old English, spoke to Graeme about the AHRC’s DMPs and spent the morning working on the Advanced Search for the REELS project. Last week I had completed the API for the advanced search and had started on the front end, and this week I managed to complete the front end for the search, including auto-complete fields where required, and supplying facilities to export the search results in CSV and JSON format. There was a lot more to this task than I’m saying here but the upshot is that we have a search facility that can be used to build up some pretty complex queries.
On Tuesday afternoon we had a project meeting for the REELS project where I demonstrated the front end facilities and we discussed some further updates that would be required for the content management system. I tackled some of these on Wednesday. The biggest issue was with adding place-name elements to historical forms. If you created a new element through the page where elements are associated with historical forms an error was encountered that caused the entire script to break and display a blank page. Thankfully after a bit of investigation I figured out what was causing this and fixed it. I also implemented the following:
- Added gender to elements
- Added ‘Epexegetic’ to the ‘role’ list
- When adding new elements to a place-name or historical form no language is selected by default, meaning entering text into the ‘element’ field searches all languages. The language appears in brackets after each element in the returned list. Once selected the element’s language is then selected in the ‘language’ list. You can still select a language before typing in an element to limit the search to that specific language
- All traces of ‘ScEng’ have been removed
- I’d noticed that when no element order was specified when you returned to the ‘manage elements’ page the various elements would sometime just appear in a random order. I’ve made it so that if no element order is entered the order is always the order in which the elements were originally added.
- When a historical form has been given elements these now appear in the table of historical forms on the ‘edit place’ page, so you can tell which forms already have elements (and what they are) without needing to load the edit a historical form page.
- Added an ‘unknown’ element. As all elements need a language I’ve assigned this to ‘Not applicable (na)’ for now.
Also on Wednesday I had to spend some time investigating why an old website of mine wasn’t displaying characters properly. This was caused by the site being moved to a new server a couple of weeks ago. It turned out to be caused by the page fragments (of which there are several thousand) being encoded as ANSI when the need to be UTF8. I thought it would be a simple task to batch process the files to convert them butI’m afraid doing something as simple as batch converting from ANSI to UTF8 is proving to be stupidly difficult. I still haven’t found a way to do it. I tried following the example in Powershell here: https://superuser.com/questions/113394/free-ansi-to-utf8-multiple-files-converter
But it turns out you can only convert to UTF8 with BOM, which adds in bad characters to the start of the file as displayed on the website. And there’s no easy way to get it without BOM, as discussed here: https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom
I then followed some of the possible methods listed here: https://gist.github.com/dogancelik/2a88c81d309a753cecd8b8460d3098bc UTFCast used to offer a ‘lite’ version for free that would have worked, but now they only offer the paid version, plus a demo. I’ve installed the demo but it only allows conversion to UTF8 with BOM as well. I got a macro working in Notepad++ but it turns out macros are utterly pointless as you can’t set them to run on multiple files at once – you need to open each file and then play the macro each time. I also installed the python script plugin for Notepad++ and tried to run the script listed on the above page but nothing happens at all – not even an error message. It was all very frustrating and I had to give up due to a lack of time. Graeme (who was also involved in this project back in the day) had an old program that can do the batch converting and he gave me a copy so I’ll try this when I get the chance.
So that was my three-day week. Next week I’ll be on strike on Monday to Wednesday so will be back at work on Thursday.