Week Beginning 23rd October 2023

After a delightful holiday last week I was back at work again this week.  This involved spending quite a bit of time catching up with emails and dealing with the ongoing issue of migrating sites from old servers to either our new external supplier or a newer server hosted internally.  I was involved with the migration of the SCOTS Corpus to a new server, with my work including fixing a few PHP errors that were cropping up on the more up to date server.  There were also some issues relating to database connections as the original code (which I didn’t write) uses rather a lot of connections – more than the new server was set to allow.  We had thought we’d fixed the issue but it looks like further investigation will be required.

We also migrated the thesaurus.ac.uk site and the Bilingual Thesaurus of Everyday Life in Medieval England (https://thesaurus.ac.uk/bth/) to a new server, which also required tweaking some of the code.  The new server was caching scripts that generated different output each time they were run (e.g. to generate the random category on the homepage), meaning the category wasn’t random but was constantly stuck on ‘Lard a roast’, which wasn’t very helpful.  Thankfully we managed to unstick the cache.

Also this week I investigated an issue with the advanced search of the Dictionaries of the Scots Language as the full-text search had stopped working.  It turned out that the Solr index that powers this search had entirely disappeared from the server, which is more than a little concerning.  It wasn’t a huge issue to rectify as I had the configuration scripts and the data on my PC, but we’re in the dark as to how the index could have been removed.  It had also been brought to my attention that some of the video files I’d uploaded for the Speech Star project before I went on holiday had also disappeared and I’ve reuploaded them too.  Our IT people are investigating what might have caused these issues and if they are linked, but it is concerning.

I also spent a bit of time looking through the old arts.gla.ac.uk server to try and figure out what needed to be retained from it.  It’s mostly old subject area sites that were long ago superseded by T4, plus old conference sites that are no longer needed.  A few of the other sites I’ve already previously moved to T4 myself (e.g. https://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/starn/  and https://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/bibliography-of-scottish-literature/).  The only site that I think need to be retained are the STELLA apps that I developed from old teaching resources in around 2015.  I therefore requested a new subdomain be set up to host them and migrated them over.  I’ve also requested we set up external hosting for arts.gla.ac.uk, purely to host redirects from old URLs so we don’t end up with broken links.  The new sites are now available (see https://stella.glasgow.ac.uk/aries/, https://stella.glasgow.ac.uk/grammar/, https://stella.glasgow.ac.uk/eoe/, https://stella.glasgow.ac.uk/metre/ and https://stella.glasgow.ac.uk/readings/) but  the redirects from the old URLs are not yet in place.  I’d really like to spend some time redeveloping all of these old apps (apart from ARIES, which has already been redeveloped).  Maybe next year I’ll find some time.

I also set up a new project website for Rhona Brown in Scottish Literature.  I’ve created a bare-bones website at the moment and I’m awaiting further instruction from her on things like themes, colour schemes, site structure and logos.  I also tweaked the project website I’d set up a couple of weeks ago for Petra Poncarova in Scottish Literature to improve the URLs for the Gaelic version of the homepage and helped a project team member get access to a site I’d set up for Matthew Creasy in English Literature.

On Wednesday morning this week I participated in a networking event for the new Research Professional Staff Network.  The event went well and it was very interesting to find out more about other people involved in research support across the University.

For the remainder of the week I began work on the development of the new ‘map first’ interface for the place-names projects, which I’m developing initially for the Iona project.  Below is a screenshot of how things look so far:

At the moment the interface consists of a narrow bar at the top of the browser window with the site’s icon, title and subtitle using the blue colour of the site’s banner as a background.  You can press on the logo or site title to navigate to the main site.  The rest of the browser is taken up with the map.  On the left is the side menu.  As discussed in the requirements document I previously wrote, it consists of four collapsible sections, with ‘Home’ open by default.  I haven’t had the time to implement the search and browse options yet, but the ‘Display options’ section is operational, as you can see above.  Pressing on the section’s title will open the section and you can access the various options.  You can show or hide the side menu by pressing on the button above it.

For the moment the map displays all data that has been marked as ‘on web’ in the CMS (362 records, I think).  By default these are colour-coded by classification code.  The legend is displayed in the top right, allowing you to turn specific features on or off.  You can also show or hide the legend to free up space.  In the bottom right are zoom options plus a ‘full screen’ button that does what you’d expect.  You can press on a map marker to open up the pop-up.  As of yet there is no link through to the full record and some Gaelic fields may be visible.  These will be removed at some point.

Using the ‘Display options’ in the side menu you can change how the map markers are classified.  We may need to be a little more fine-grained with start date and especially altitude.  Also colours for classification codes are currently arbitrarily assigned but we might want to change this – having blue for ‘field’ seems a bit daft, for example.  You can also change the base map and these options are currently the same as for the other place-name sites.  We still need to figure out if / how we can integrate another map of Iona that we discussed at a meeting before I went on holiday.  There is also an option to turn labels on or off.

That’s as far as I’ve got this week.  There’s still a lot to do but I’ve made pretty good progress.  I’ll hopefully find some time to continue with this next week.  I also discovered that the Leaflet mapping library has a method to set the map view so as to show all markers at the closest zoom possible so I’ll ensure I use this when I develop the search and the browse.  I’m currently already using it when the map is first opened to ensure that all of Iona, Soa in the south-west and Eilean Annraidh in the north-east are always visible, no matter what dimensions your screen / browse window are.

Week Beginning 6th September 2021

I spent more than a day this week preparing my performance and development review form.  It’s the first time there’s been a PDR since before covid and it took some time to prepare everything.  Thankfully this blog provides a good record of everything I’ve done so I could base my form almost entirely on the material found here, which helped considerably.

Also this week I investigated and fixed an issue with the SCOTS corpus for Wendy Anderson.  One of the transcriptions of two speakers had the speaker IDs the wrong way round compared to the IDs in the metadata.  This was slightly complicated to sort out as I wasn’t sure whether it was better to change the participant metadata to match the IDs used in the text or vice-versa.  It turned out to be very difficult to change the IDs in the metadata as they are used to link numerous tables in the database, so instead I updated the text that’s displayed.  Rather strangely, the ‘download plan text’ file contained different incorrect IDs.  I fixed this as well, but it does make me worry that the IDs might be off in other plain text transcriptions too.  However, I looked at a couple of others and they seem ok, though, so perhaps it’s an isolated case.

I was contacted this week by a lecturer in English Literature who is intending to put a proposal together for a project to transcribe an author’s correspondence, and I spent some time writing a lengthy email with home helpful advice.  I also spoke to Jennifer Smith about her ‘Speak for Yersel’ project that’s starting this month, and we arranged to have a meeting the week after next.  I also spent quite a bit of time continuing to work on mockups for the STAR project’s websites based on feedback I’d received on the mockups I completed last week.  I created another four mockups with different colours, fonts and layouts, which should give the team plenty of options to decide from.  I also received more than a thousand new page images of library registers for the Books and Borrowing project and processed these and uploaded them to the server.  I’ll need to generate page records for them next week.

Finally, I continued to make updates to the Textbase search facilities for the Anglo-Norman Dictionary.  I updated genre headings to make them bigger and bolder, with more of a gap between the heading and the preceding items.  I also added a larger indent to the items within a genre and reordered the genres based on a new suggested order.  For each book I included the siglum as a link through to the book’s entry on the bibliography page and in the search results where a result’s page has an underscore in it the reference now displays volume and page number (e.g. 3_801 displays as ‘Volume 3, page 801’).  I updated the textbase text page so that page dividers in the continuous text also display volume and page in such cases.

Highlighted terms in the textbase text page no longer have padding around them (which was causing what looked like spaces when the term appears mid-word).  The text highlighting is unfortunately a bit of a blunt instrument, as one of the editors discovered by searching for the terms ‘le’ and fable’:  term 1 is located and highlighted first, then term 2 is.  In this example the first term is ‘le’ and the second term is ‘fable’.  Therefore the ‘le’ in ‘fable’ is highlighted during the first sweep and then ‘fable’ itself isn’t highlighted as it has already been changed to have the markup for the ‘le’ highlighting added to it and no longer matches ‘fable’.  Also, ‘le’ is matching some HTML tags buried in the text (‘style’), which is then breaking the HTML, which is why some HTML is getting displayed.  I’m not sure much can be done about any of this without a massive reworking of things, but it’s only an issue when searching for things like ‘le’ rather than actual content words so hopefully it’s not such a big deal.

The editor also wondered whether it would be possible to add in an option for searching and viewing multiple terms altogether but this would require me to rework the entire search and it’s not something I want to tackle if I can avoid it.  If a user wants to view the search results for different terms they can select two terms then open the full results in a new tab, repeating the process for each pair of terms they’re interested in, switching from tab to tab as required. Next week I’ll need to rename some of the textbase texts and split one of the texts into two separate texts, which is going to require me to regenerate the entire dataset.

Week Beginning 5th April 2021

This week began with Easter Monday, which was a holiday.  I’d also taken Tuesday and Thursday off to cover some of the Easter school holidays so it was a two-day working week for me.  I spent some of this time continuing to download and process images of library register books for the Books and Borrowing project, including 14 from St Andrews and several further books from Edinburgh.  I was also in communication with one of the people responsible for the Dictionary of the Scots Language’s new editor interface regarding the export of new data from this interface and importing it into the DSL’s website.  I was sent a ZIP file containing a sample of the data for SND and DOST, plus a sample of the bibliographical data, with some information on the structure of the files and some points for discussion.

I looked through all of the files and considered how I might be able to incorporate the data into the systems that I created for the DSL’s website.  I should be able to run the new dictionary XML files through my upload script with only a few minor modifications required.  It’s also really great that the bibliographies and cross references are getting sorted via the new Editor interface.  One point of discussion is that the new editor interface has generated new IDs for the entries, and the old IDs are not included.  I reckoned that it would be good if the old IDs were included in the XML as well, just in case we ever need to match up the current data with older datasets.  I did notice that the old IDs already appeared to be included in the <url> fields, but after discussion we decided that it would be safer to include them as an attribute of the <entry> tag, e.g. <entry oldid=”snd848”> or something like that, which is what will happen when I receive the full dataset.

There are also new labels for entries, stating when and how the entry was prepared.  The actual labels are stored in a spreadsheet and a numerical ID appears in the XML to reference a row in the spreadsheet.  This method of dealing with labels seems fine with me – I can update my system to use the labels from the spreadsheet and display the relevant labels depending on the numerical codes in the entry XML.  I reckon it’s probably better to not store the actual labels in the XML as this saves space and makes it easier to change the label text, if required, as it’s only then stored in a single place.

The bibliographies are looking good in the sample data, but I pointed out that it might be handy to have a reference of the old bibliographical IDs in the XML, if that’s possible.  There were also spurious xmlns=”” attributes in the new XML, but these shouldn’t pose any problems and I said that it’s ok to leave them in.  Once I receive the full dataset with some tweaks (e.g. the inclusion of old IDs) then I will do some further work on this.

I spent most of the rest of my available time working on the new Comparative Kingship place-names systems.  I completed work on the Scotland CMS, including adding in the required parishes and former parishes.  This means my place-name system has now been fully modernised and uses the Bootstrap framework throughout, which looks a lot better and works more effectively on all screen dimensions.

I also imported the data from GB1900 for the relevant parishes.  There are more than 10,000 names, although a lot of these could be trimmed out – lots of ‘F.P.’ for footpath etc.  It’s likely that the parishes listed are rather broader than the study will be.  All the names in and around St Andrews are in there, for example.  In order to generate altitude for each of the names imported from GB1900 I had to run a script I’d written that passes the latitude and longitude for each name in turn to Google Maps, which then returns elevation data.  I had to limit the frequency of submissions to one every few seconds otherwise Google blocks access, so it took rather a long time for the altitudes of more than 10,000 names to be gathered, but the process completed successfully.

Also this week I dealt with an issue with the SCOTS corpus, which had broken (the database had gone offline) and helped Raymond at Arts IT Support to investigate why the Anglo-Norman Dictionary server had been blocking uploads to the dictionary management system when thousands of files were added to the upload form.  It turns out that while the Glasgow IP address range was added into the whitelist the VPN’s IP address range wasn’t, which is why uploads were being blocked.

Next week I’m also taking a couple of days off to cover the Easter School holidays, and will no doubt continue with the DSL and Comparative Kingship projects then.

Week Beginning 29th June 2020

This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day.  I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects.  I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records.  This week I began by focussing on the data from St Andrews University library.

The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting.  However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.

The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100.  The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers.  Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.

I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn.  From the filenames my script could ascertain the ledger volume, its dates and the page number.  For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10.  The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.

The actual data in the file posed further problems.  As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system.  Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows).  I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically.  The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’).  I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.

I also used this approach to set up books and borrowers too.  If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which.  However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows.  I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code.  There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.

The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system.  I created four new ‘Additional Fields’ for St Andrews as follows:

  • Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
  • Code: This contains the full text of the second column (e.g. J.5.2)
  • Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
  • Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)

In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links.  Where subsequent rows contain data in this column but no code, this data is then appended to the transcription.  E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.

The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required.  Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes.  E.g. for the above the notes field contains:

‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit &amp; au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. [1745]) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>

—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>

—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>


As mentioned earlier, the script also generates book and borrower records based on the linked pages too.  I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews.  In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field.  One book item is created for each holding to be used to link to the corresponding borrowing records.

For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’).  All borrowers are linked to records as ‘Main’ borrowers.

During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table.  I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page.  After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.

I then moved onto looking at the data for Selkirk library.  This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers.  Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated.  The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.

As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records.  There are 612 Holding records for Selkirk.  I also processed the borrower records, resulting in 86 borrower records being added.  I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.

Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project.  I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename.  I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included.  Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.

Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.

Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should  ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office.  This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.

Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case.  Following these guidelines the filenames would look something like this:

  • jpg
  • dumfries-presbytery_2_3v.jpg
  • standrews-ul_9_300r.jpg

In addition to the Books and Borrowing project I worked on a number of other projects this week.  I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.

I also worked a bit more on using dates from the OED data in the Historical Thesaurus.  Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT.  I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels.  I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without.  I decided to process things as follows:

  • ?a will just be ‘a’
  • ?c will just be ‘c’
  • ? without an ‘a’ or ‘c’ will be ‘c’.

I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this.  I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’.  I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).

While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted.  This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table.  This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.

I set my extraction script running on the OED XML files on Wednesday and processing took a long time.  The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows.  It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.

I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project.  The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category.  Variant spellings are also removed (these were all tagged with <form> and I removed all of these).  I also created a new field to store only the ‘NN_’ tagged words and remove all others.

The scripts generated three datasets, which I saved as spreadsheets for Fraser.  The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading.  The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category.  For this table I’ve also added in the full list of lexemes for each HT category too.

I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.

Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data.  This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.

I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent.  The entry for snd13897 contains the following URLs:




The first is the ID for the main entry and the other two are child entries.  If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged.  But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content).  The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above).  For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217.  But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.

The entry for dost16606 contains the following URLs:



(in addition to headword URLs).  Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content).  As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.

What we need to do is remove the child entries that still exist as separate entries in the data.  To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files.  It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID.  If it does then presumably this is a duplicate that should then be deleted.  I’m waiting to hear back from the DSL people to see how we should proceed with this.

As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.

Week Beginning 1st June 2020

During week 11 of Lockdown I continued to work on the Books and Borrowing project, but also spent a fair amount of time catching up with other projects that I’d had to put to one side due to the development of the Books and Borrowing content management system.  This included reading through the proposal documentation for Jennifer Smith’s follow-on funding application for SCOSYA, and writing a new version of the Data Management Plan based on this updated documentation and making some changes to the ‘export data for print publication’ facility for Carole Hough’s REELS project.  I also spent some time creating as new export facility to format the place-name elements and any associated place-names for print publication too.

During this week a number of SSL certificates expired for a bunch of websites, which meant browsers were displaying scary warning messages when people visited the sites.  I had to spend a bit of time tracking these down and passing the details over to Arts IT Support for them to fix as it is not something I have access rights to do myself.  I also liaised with Mike Black to migrate some websites over from the server that houses many project websites to a new server.  This is because the old server is running out of space and is getting rather temperamental and freeing up some space should address the issue.

I also made some further tweaks to Paul Malgrati’s interactive map of Burns’ Suppers and created a new WordPress-powered project website for Matthew Creasy’s new ‘Scottish Cosmopolitanism at the Fin de Siècle’ project.  This included the usual choosing a theme, colour schemes and fonts, adding in header images and footer logos and creating initial versions of the main pages of the site.  I’d also received a query from Jane Stuart-Smith about the audio recordings in the SCOTS Corpus so I did a bit of investigation about that.

Fraser Dallachy had got back to me with some further tasks for me to carry out on the processing of dates for the Historical Thesaurus, and I had intended to spend some time on this towards the end of the week, but when I began to look into this I realised that the scripts I’d written to process the old HT dates (comprising 23 different fields) and to generate the new, streamlined date system that uses a related table with just 6 fields were sitting on my PC in my office at work.  Usually all the scripts I work on are located on a server, meaning I can easily access them from anywhere by connecting to the server and downloading them.  However, sometimes I can’t run the scripts on the server as they may need to be left running for hours (or sometimes days) if they’re processing large amounts of data or performing intensive tasks on the data.  In these cases the scripts run directly on my office PC, and this was the situation with the dates script.  I realised I would need to get into my office at work on retrieve the scripts, so I put in a request to be allowed into work.  Staff are not currently allowed to just go into work – instead you need to get approval from your Head of School and then arrange a time that suits security.  Thankfully it looks like I’ll be able to go in early next week.

Other than these issues, I spent my time continuing to work for the Books and Borrowing project.  On Tuesday we had a Zoom call with all six members of the core project team, during which I demonstrated the CMS as it currently stands.  This gave me an opportunity to demonstrate the new Author association facilities I had created last week.  The demonstration all went very smoothly and I think the team are happy with how the system works, although no doubt once they actually begin to use it there will be bugs to fix and workflows to tweak.  I also spent some time before the meeting testing the system again, and fixing some issues that were not quite right with the author system.

I spent the remainder of my time on the project completing work on the facility to add, edit and view book holding records directly via the library page, as opposed to doing so whilst adding / editing a borrowing record.  I also implemented a similar facility for borrowers as well.  Next week I will begin to import some of the sample data from various libraries into the system and will allow the team to access the system to test it out.

Week Beginning 16th September 2019

I spent some time this week investigating the final part of the SCOSYA online resource that I needed to implement: A system whereby researchers could request access to the full audio dataset and a member of the team could approve the request and grant the person access to a facility where the required data could be downloaded.  Downloads would be a series of large ZIP files containing WAV files and accompanying textual data.  As we wanted to restrict access to legitimate users only I needed to ensure that the ZIP files were not directly web accessible, but were passed through to a web accessible location on request by a PHP script.

I created a test version using a 7.5Gb ZIP file that had been created a couple of months ago for the project’s ‘data hack’ event.  This version can be set up to store the ZIP files in a non-web accessible directory and then grab a file and pass it through to the browser on request.  It will be possible to add user authentication to the script to ensure that it can only be executed by a registered user.  The actual location of the ZIP files is never divulged so neither registered nor unregistered users will ever be able to directly link to or download the files (other than via the authenticated script).

This all sounds promising but I realised that there are some serious issues with this approach.  HTTP as used by web pages to transfer files is not really intended for downloading huge files and using this web-based method to download massive zip files is just not going to work very well.  The test ZIP file I used was about 7.5Gb in size (roughly the size of a DVD), but the actual ZIP files are likely to be much larger than this – with the full dataset taking up about 180Gb.  Even using my desktop PC on the University network it’s taken roughly 30 minutes to download the 7.5Gb file.  Using an external network would likely take a lot longer and bigger files are likely to be pretty unmanageable for people to download.

It’s also likely that a pretty small number of researchers will be requesting the data, and if this is the case then perhaps it’s not such a good idea to take up 180Gb of web server space (plus the overheads of backups) to store data that is seldomly going to be accessed, especially if this is simply replicating data that is already taking up a considerable amount of space on the shared network drive.  180Gb is probably more web space than is used by most other Critical Studies websites combined.  After discussing this issue with the team, we decided that we would not set up such a web-based resource to access the data, but would instead send ZIP files on request to researchers using the University’s transfer service, which allows files of up to 20Gb to be sent to both internal and external email addresses.  We’ll need to see how this approach works out, but I think it’s a better starting point than setting up our own online system.

I also spent some further time on the SCOSYA project this week implementing some changes to both the experts and the public atlases based on feedback from the team.  This included changing the default map position and zoom level, replacing some of the colours used for map markers and menu items, tweaking the layout of the transcriptions, ensuring certain words in story titles can appear in bold (as opposed to the whole title being bold as was previously the case) and removing descriptions from the list of features found in the ‘Explore’ menu in the public atlas.  I also added a bit of code to ensure that internal links from story pages to other parts of the public atlas would work (previously they weren’t doing anything because only the part after the hash was changing).  I also ensured that the experts atlas side panel resizes to fit the content whenever an additional attribute is added or removed.

Also this week I finally found a bit of time to fix the map on the advanced search page of the SCOTS Corpus website.  This map was previously powered by Google Maps, but they have now removed free access to the Google Maps service (you now need to provide a credit card and get billed if your usage goes over a certain number of free hits a month).  As we hadn’t updated the map or provided such details Google broke the map, covering it with warning messages and removing our custom map styles.  I have now replaced the Google Maps version with a map created using the free to use Leaflet,js mapping library (as I’m using for SCOSYA) and a free map tileset from OpenStreetMap.  Other than that it works in exactly the same way as the old Google Map.  The new version is now live here: https://www.scottishcorpus.ac.uk/advanced-search/.

Also this week I upgraded all of the WordPress sites I manage, engaged in some App Store duties and had a further email conversation with Marc Alexander about how dates may be handled in the Historical Thesaurus.  I also engaged in a long email conversation with Heather Pagan of the Anglo-Norman Dictionary about accessing the dictionary data.  Heather has now managed to access one of the servers that the dictionary website runs on and we’re now trying to figure out exactly where the ‘live’ data is located so that I can work with it.  I also fixed a couple of issues with the widgets I’d created last week for the GlasgowMedHums project (some test data was getting pulled into them) and tweaked a couple of pages.  The project website is launching tomorrow so if anyone wants to access it they can do so here: https://glasgowmedhums.ac.uk/

Finally, I continued to work on the new API for the Dictionary of the Scots Language, implementing the bibliography search for the ‘v2’ API.  This version of the API uses data extracted from the original API, and the test website I’ve set up to connect to it should be identical to the live site, but connects to the ‘v2’ API to get all of its data and in no way connects to the old, undocumented API.  API calls to search the bibliographies (both a predictive search used for displaying the auto-complete results and to populate a full search results page), and to display an individual bibliography are now available and I’ve connected the test site to these API calls, so staff can search for bibliographies here.

Whilst investigating how to replicate the original API I realised that the bibliography search on the live site is actually a bit broken.  The ‘Full Text’ search simply doesn’t work, but instead just does the same as a search for authors and titles (in fact the original API doesn’t even include a ‘full text’ option).  Also, results only display authors, so for records with no author you get some pretty unhelpful results.  I did consider adding in a full-text search, but as bibliographies contain little other than authors and titles there didn’t seem much point, so instead I’ve removed the option.  As the search is primarily set up as an auto-complete, which is set up to match words in authors or titles that begin with the characters that are being typed (i.e. a wildcard search such as ‘wild*’) and the full search results page only gets displayed if someone ignores the auto-complete list of results and manually presses the ‘Search’ button, I’ve made full search results page always work as a ‘wild*’ search too.  So typing ‘aber’ into the search box and pressing ‘Search’ will bring up a list of all bibliographies with titles / authors featuring a word beginning with these characters.  With the previous version this wasn’t the case – you had to add a ‘*’ after ‘aber’ otherwise the full search results page would match ‘aber’ exactly and find nothing.  I’ve updated the help text on the bibliography search page to explain this a bit.

The full search results (and the results side panel) in the new version now include titles as well as authors, which makes things clearer and I’ve also made the search results numbering appear at the top of the corresponding result text rather than on the last line.  This is also the case for entry searches too.  Once the test site has been fully tested and approved we should be able to replace the live site with the new site (ensuring all WordPress content from the live site is carried over, of course).  Doing so will mean the old server containing the original API can (once we’re confident all is well) be switched off.  There is still the matter of implementing the bibliography search for the V3 data, but as mentioned previously this will probably be best tackled once we have sorted out the issues with the data and we are getting ready to launch the new version.

Week Beginning 26th August 2019

I focussed on the SCOSYA project for the first few days of this week.  I need to get everything ready to launch by the end of September and there is an awful lot still left to do, so this is really my priority at the moment.  I’d noticed over the weekend that the story pane wasn’t scrolling properly on my iPad when the length of the slide was longer than the height of the atlas.  In such cases the content was just getting cut off and you couldn’t scroll down to view the rest or press the navigation buttons.  This was weird as I thought I’d fixed this issue before.  I spent quite a bit of time on Monday investigating the issue, which has resulted in me having to rewrite a lot of the slide code.  After much investigation I reckoned that this was an intermittent fault caused by the code returning a negative value for the height of the story pane instead of its real height.  When the user presses the button to load a new slide the code pulls the HTML content of the slide in and immediately displays it.  After that another part of the code then checks the height of the slide to see if the new contents make the area taller than the atlas, and if so the story area is then resized.  The loading of the HTML using jQuery’s html() function should be ‘synchronous’ – i.e. the following parts of code should not execute before the loading of the HTML is completed.  But sometimes this wasn’t the case – the new slide contents weren’t being displayed before the check for the new slide height was being run, meaning the slide height check was giving a negative value (no contents minus the padding round the slide).  The slide contents then displayed but as the code thought the slide height was less than the atlas it was not resizing the slide, even when it needed to.  It is a bit of a weird situation as according to the documentation it shouldn’t ever happen.  I’ve had to put a short ‘timeout’ into the script as a work-around – after the slide loads the code waits for half a second before checking for the slide height and resizing, if necessary.  This seems to be working but it’s still annoying to have to do this.  I tested this out on my Android phone and on my desktop Windows PC with the browser set to a narrow height and all seemed to be working.  However, when I got home I tested the updated site out on my iPad and it still wasn’t working, which was infuriating as it was working perfectly on other touchscreens.

In order to fix the issue I needed to entirely change how the story pane works.  Previously the story pane was just an HTML area that I’d added to the page and then styled to position within the map, but there were clearly some conflicts with the mapping library Leaflet when using this approach.  The story pane was positioned within the map area and mouse actions that Leaflet picks up (scrolling and clicking for zoom and pan) were interfering with regular mouse actions in the HTML story area (clicking on links, scrolling HTML areas).  I realised that scrolling within the menu on the left of the map was working fine on the iPad so I investigated how this differed from the story pane on the right.  It turned out that the menu wasn’t just a plain HTML area but was instead created by a plugin for Leaflet that extends Leaflet’s ‘Control’ options (used for buttons like ‘+/-‘ and the legend).  Leaflet automatically prevents the map’s mouse actions from working within its control areas, which is why scrolling in the left-hand menu worked.  I therefore created my own Leaflet plugin for the story pane, based on the menu plugin.  Using this method to create the story area thankfully worked on my iPad, but it did unfortunately taken several hours to get things working, which was time I should ideally have been spending on the Experts interface.  It needed to be done, though, as we could hardly launch an interface that didn’t work on iPads.

I also has to spend some further time this week making some more tweaks to the story interface that the team had suggested such as changing the marker colour for the ‘Home’ maps, updating some of the explanatory text and changing the pop-up text on the ‘Home’ map to add in buttons linking through to the stories.  The team also wanted to be able to have blank maps in the stories, to make users focus on the text in the story pane rather than getting confused by all of the markers.  Having blank maps for a story slide wasn’t something the script was set up to expect, and although it was sort of working, if you navigated from a map with markers to a blank map and then back again the script would break, so I spent some time fixing this.  I also managed to find a bit of time starting on the experts interface, although less time than I had hoped.  For this I’ve needed to take elements from the atlas I’d created for staff use, but adapt it to incorporate changes that I’d introduced for the public atlas.  This has basically meant starting from scratch and introducing new features one by one.  So far I have the basic ‘Home’ map showing locations and the menu working.  There is still a lot left to do.

I spent the best part of two days this week working on the front-end for the 18th Century Borrowing pilot project for Matthew Sangster.  I wrote a little document that detailed all of the features I was intending to develop and sent this to Matt so he could check to see if what I’m doing met his expectations.  I spent the rest of the time working on the interface, and made some pretty good progress.  So far I’ve made an initial interface for the website (which is just temporary and any aspect of which can be changed as required), I’ve written scripts to generate the student forename / surname and professor title / surname columns to enable searching by surname, and I’ve created thumbnails of the images.  The latter was a bit of a nightmare as previously I’d batch rotated the images 90 degrees clockwise as the manuscripts (as far as I could tell) were written in landscape format but the digitised images were portrait, meaning everything was on its side.

However, I did this using the Windows image viewer, which gives the option of applying the rotation to all images in a folder.  What I didn’t realise is that the image viewer doesn’t update the metadata embedded in the images, and this information is used by browsers to decide which way round to display the images.  I ended up in a rather strange situation where the images looked perfect on my Windows PC, and also when opened directly within the browser, but when embedded in an HTML page they appeared on their side.  It took a while to figure out why this was happening, but once I did I regenerated the thumbnails using the command-line ImageMagick tool instead, which I set to wipe the image metadata as well as rotating the images, which seemed to work.  That is until I realised that Manuscript 6 was written in portrait not landscape so I had to repeat the process again but miss out Manuscript 6.  I have since realised that all the batch processing of images I did to generate tiles for the zooming and panning interface is also now going to be wrong for all landscape images and I’m going to have to redo all of this too.

Anyway, I also made the facility where a user can browse the pages of the manuscripts, enabling them to select a register, view the thumbnails of each page contained therein and then click through to view all of the records on the page.  This ‘view records’ page has both a text and image view.  The former displays all of the information about each record on the page in a tabular manner, including links through to the GUL catalogue and the ESTC.  The latter presents the image in a zoomable / pannable manner, but as mentioned earlier, the bloody image is on its side for any manuscript written in a landscape way and I still need to fix this, as the following screenshot demonstrates:

Also this week I spent a further bit of time preparing for my PDR session that I will be having next week, spoke to Wendy Anderson about updates to the SCOTS Corpus advanced search map that I need to fix, fixed an issue with the Medical Humanities Network website, made some further tweaks to the RNSN song stories and spoke to Ann Ferguson at the DSL about the bibliographical data that needs to be incorporated into the new APIs.  A another pretty busy week, all things considered.


Week Beginning 23rd July 2018

I continued with the group statistics feature for the SCOSYA project this week.  Last week Gary had let me know that he was experiencing issues when using the feature with a large group he had created, so I did some checking of functionality.  I created a group with 140 locations in it and tried out the feature with a variety of searches on a variety of devices, operating systems and browsers but didn’t encounter any issues.  Thankfully it turned out that Gary needed to clear his browser’s cache, and with that done the feature worked perfectly for him.  Gary had also reported an issue with the data export facilitiy I created a while back for the project team to use.  It was working fine if limits on the returned data were included, but gave nothing but a blank page when all the data was requested.  After a bit of investigation I reached the conclusion that it must be a some kind of limit imposed on the server, and a quick check with Chris revealed that when the script returned all of the data it was exceeding a memory limit.  When Chris increased the limit the script began to work perfectly again.

In addition to these investigations I added a couple of new pieces of functionality to the group statistics feature.  I added in the option to show or hide locations that are not part of your selected group, allowing the user to cut down on the clutter and focus on the locations that they are partiuclarly interested in.  I also added in an option to download the data relating specifically to the user’s selected locations, rather than for all locations.  This meant updating the project’s API to allow any number of locations to be included in the GET request sent to the server.  Unfortunately this uncovered another server setting that was preventing certain requests working.  With many locations selected the URL sent to the API is very long, and in such cases the request was not fully getting through to my API scripts but was instead getting blocked by the server.  Rather than processing the API’s default index page was displaying, but wothout the CSS file properly loadng.  With shorter URLs the request got through fine.  I checked with Chris and a setting on the server was limiting URL parameters to 512 characters in length.  Chris increased this and the request got through and returned the required data.  With this issue out of the way the ‘download group data’ feature worked properly.  I had been making these changes on a temporary version of the atlas in the CMS, but with everything in place I moved my temporary version over to the main atlas, and all seems to be working well.

I had a few meetings this week.  The first was with someone from a start-up company who are wanting to develop some kind of transcription service.  We talked about the SCOTS corpus and its time-aligned transcriptions of audio files.  I’m not sure how much help I really was, however, as what they really need is a tool to create such transcriptions rather than publish them, and the SCOTS project used a different tool to do this called PRAAT.  The guy is going to meet with Jane Stuart-Smith who should be able to give more information on this side of things, and also with Wendy Anderson who knows a bit more about the history of the SCOTS project than I do, so maybe these subsequent meetings will be more useful.  I also met with Ewa Wanat, a PhD student in English Language, who is wanting to put together an app about rhythm and metre in English.  I gave her some advice about the sorts of tools she could use to develop the app and showed her the ‘English Metre’ app I created last year.  She already has a technical partner in mind for her project so probably won’t need me to do the actual work, but I think I was able to give her some useful advice.  I also met with Scott Spurlock from Theology, for whom I will be creating a crowdsourcing tool that will be used to transcribe some records of the Church of Scotland.  There has been a bit of a delay in getting the images for the project, and Scott hasn’t decided what URL he would like for the project, but once these things are sorted I’ll be able to start to work developing the tool, hopefully using some existing technologies.

Before I went away on holiday the SLD people were in touch to say that the Android version of the Scots Dictionary for Schools app had been taken down, and the person with the account details had retired without passing the account details on.  We tried various approaches to get access to the account but in the end it looked like the only thing to do would be to create a new account and republish the app.  Thomas Widmann set up the account just before I went away and I said I’d sort out the technical side of things when I got back to the office.  On Friday this week I tackled this task.  As I suspected, it look rather a long time to get all of the technologies up to date again.  I don’t develop apps all that often and it seems that every time I come to develop a new one (or create a new version of an old one) the software and methodologies needed to publish an app have all changed.  It took most of the morning to install the necessary software updates, and a fair bit of the afternoon to figure out how the new workflow for publishing an app would work.  However, I got there in the end and by the end of the day the new version was available for download (for free) via the Google Play store.  You can access the dictionary app here:  https://play.google.com/store/apps/details?id=com.sld.ssd2

I’m on holiday on Monday to Wednesday next week, so next week’s report should be rather shorter.

Week Beginning 21st May 2018

I spent most of this week working on the new timeline features for the Historical Thesaurus.  Marc, Fraser and I had a useful meeting on Wednesday where we discussed some final tweaks to the mini-timelines and the category page in general, and also discussed some future updates to the sparklines.

I made the mini-timelines slightly smaller than they were previously, and Marc changed the colours used for them.  I also updated the script that generates the category page content via an AJAX call so that an additional ‘sort by’ option could be passed to it.  I then implemented sorting options that matched up with those available through the full Timeline feature, namely sorting by first attested date, alphabetically, and length of use.  I also updated this script to allow users to control whether the mini-timelines appear on the page or not.  With these options available via the back-end script I then set up the choices to be stored as a session variable, meaning the user’s choices are ‘remembered’ as they navigate throughout the site and can be applied automatically to the data.

While working on the sorting options I noticed that the alphabetical ordering of the main timeline didn’t properly order ashes and thorns – e.g. words beginning with these were appearing at the end of the list when ordered alphabetically.  I fixed this so that for ordering purposes an ash is considered ‘ae’ and a thorn ‘th’.  This doesn’t affect how words are displayed, just how they are ordered.

We also decided at the meeting that we would move the thesaurus sites that were on a dedicated (but old) server (namely HT, Mapping Metaphor, Thesaurus of Old English and a few others) to a more centrally hosted server that is more up to date.  This switch would allow these sites to be made available via HTTPS as opposed to HTTP and will free up the old server for us to use for other things, such as some potential corpus based resources.  Chris migrated the content over and after we’d sorted a couple of initial issues with the databases all of the sites appear to be working well.  It is also a really good thing to have the sites available via HTTPS.  We are also now considering setting up a top-level ‘.ac.uk’ address for the HT and spent some time making a case for this.

A fairly major feature I added to the HT this week was a ‘menu’ section for main categories, which contains some additional options, such as the options to change the sorting of the category pages and turn the mini-timelines on and off.  For the button to open the section I decided to use the ‘hamburger’ icon, which Marc favoured, rather than a cog, which I was initially thinking of using, because a cog suggests managing options whereas this section contains both options and additional features.  I initially tried adding the drop-down section as near to the icon as possible, but I didn’t like the way it split up the category information, so instead I set it to appear beneath the part of speech selection.  I think this will be ok as it’s not hugely far away from the icon.  I did wonder whether instead I should have a section that ‘slides up’ above the category heading, but decided this was a bad idea as if the user has the heading at the very top of the screen it might not be obvious that anything has happened.

The new section contains buttons to open the ‘timeline’ and ‘cite’ options.  I’ve expanded the text to read ‘Timeline visualization’ and ‘Cite this category’ respectively.  Below these buttons there are the options to sort the words.  Selecting a sort option reloads the content of the category pane (maincat and subcats), while keeping the drop-down area open.  Your choice is ‘remembered’ for the duration of your session, so you don’t have to keep changing the ordering as you navigate about.  Changing to another part of speech or to a different category closes the drop-down section.  I also updated the ‘There are xx words’ text to make it clearer how the words are ordered if the drop-down section is not open.

Below the sorting option is a further option that allows you to turn on or off the mini-timelines.  As with the sorting option, your choice is ‘remembered’.  I also added some tooltip text to the ‘hamburger’ icon, as I thought it was useful to have some detail about what the button does.

I then updated the main timeline so that the default sorting option aligns itself with the choice you made on the category page.  E.g. If you’ve ordered the category by ‘length of use’ then the main timeline will be ordered this way too when you open it.  I also set things up so that if you change the ordering via the main timeline pop-up then the ordering of the category will be updated to reflect your choice when you close the popup, although Fraser didn’t like this so I’ll probably remove this feature next week.  Here’s how the new category page looks with the options menu opened:

I spent some more time on the REELS project this week, as Eila had got back to me with some feedback about the front-end.  This included changing the ‘Other’ icon, which Eila didn’t like.  I wasn’t too keen on it either, was I was happy to change it.  I now use a sort of archway instead of the tall, thin monument, which I think works better.  I also removed non-Berwickshire parishes from the Advanced Search page, tweaked some of the site text and also fixed the search for element language, which I had inadvertently broken when changing the way date searches worked last week.

Also this week I fixed an issue with the SCOTS corpus, which was giving 403 errors instead of playing the audio and video files, and was giving no results on the Advanced Search page.  It turned out that this was being caused by a security patch that had been installed on the server recently, which was blocking legitimate requests for data.  I was also in touch with Scot Spurlock about his crowdsourcing project, that looks to be going ahead in some capacity, although not with the funding that was initially hoped for.

Finally, I had received some feedback from Faye Hammill and her project partners about the data management plan I’d written for her project.  I responded to some queries and finalised some other parts of the plan, sending off a rather extensive list of comments to her on Friday.

Week Beginning 19th March 2018

With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while.  I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place.  This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points.  There are still some areas where I need further input from Faye, but we do at least have a first draft now.

I also created a project website for Anna McFarlane’s British Academy funded project.  The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good.  After sorting that out I then returned to the REELS project.  I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end.  It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.

I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project.  Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files.  Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6.  This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.

I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file.  With this in place I set the script running on the entire EEBO directory.  I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.

My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database.  Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct.  Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point.  And even more annoyingly it didn’t fail in an orderly manner.  E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.

I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with.  However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index.  I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact.  I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere.  Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data.  But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.

The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this.  Chris said he’d sort a temporary solution out for me, which is great.  I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table.  After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection.  Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together.  For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.

Whilst working with the data I noticed that a significant amount of it is unusable.  Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger.  A lot of these are mis-classified words that have an asterisk or a dash at the start.  If the asterisk / dash had been removed then the word could have been successfully tagged.  E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’.  Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.

Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used.  The output has one row per heading and a column for each of the top 10 (or less if there are less than 10).  This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625.  I’ve sent this to Fraser and once he gets back to me I’ll proceed further.

In addition to the above big tasks, I also dealt with a number of smaller issues.  Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him.  I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites:  For SWAP I deleted the input forms as these were sending spam to Carole.  I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.

I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP.  This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites.  Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus.  There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine.  Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site.  Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.

I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon.  Gary is going to try and set up a meeting with Jennifer about this next week.  On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised.  There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project.  It was really interesting to hear about these projects and their approaches to managing transcriptions.