Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.

Week Beginning 31st October 2022

I spent a lot of the week continuing to work on the Books and Borrowing front end.  To begin with I worked on the ‘borrowers’ tab in the ‘library’ page and created an initial version of it.  Here’s an example of how it looks:

As with books, the page lists borrowers alphabetically, in this case by borrower surname.  Letter tabs and counts of the number of borrowers with surnames beginning with the letter appear at the top and you can select a letter to view all borrowers with surnames beginning with the letter.  I had to create a couple of new fields in the borrower table to speed the querying up, saving the initial letter of each borrower’s surname and a count of their borrowings.

The display of borrowers is similar to the display of books, with each borrower given a box that you can press on to highlight.  Borrower ID appears in the top right and each borrower’s full name appears as a green title.  The name is listed as it would be read, but this could be updated if required.  I’m not sure where the ‘other title’ field would go if we did this, though – presumably something like ‘Macdonald, Mr Archibald of Sanda’.

The full information about a borrower is listed in the box, including additional fields and normalised occupations.  Cross references to other borrowers also appear.  As with the ‘Books’ tab, much of this data will be linked to search results once I’ve created the search options (e.g. press on an occupation to view all borrowers with this occupation, press on the number of borrowings to view the borrowings) but this is not in place yet.  You can also change the view from ‘surname’ to ‘top 100 borrowers’, which lists the top 100 most prolific borrowers (or less if there are less than 100 borrowers in the library).  As with the book tab, a number appears at the top left of each record to show the borrower’s place on the ‘hitlist’ and the number of borrowings is highlighted in red to make it easier to spot.

I also fixed some issues with the book and author caches that were being caused by spaces at the start of fields and author surnames beginning with a non-capitalised letter (e.g. ‘von’) which was messing things up as the cache generation script was previously only matching upper case, meaning ‘v’ wasn’t getting added to ‘V’.  I’ve regenerated the cache to fix this.

I then decided to move onto the search rather than the ‘Facts & figures’ tab as I reckoned this should be prioritised.  I began work on the quick search initially, and I’m still very much in the middle of this.  The quick search has to search an awful lot of and to do this several different queries need to be run.  I’ll need to see how this works in terms of performance as I fear the ‘quick’ search risks being better named the ‘slow’ search.

We’ve stated that users will be able to search for dates in the quick search and these need to be handled differently.  For now the API checks to see whether the passed search string is a date by running a pattern match on the string.  This converts all numbers in the string into an ‘X’ character and then checks to see whether the resulting string matches a valid date form.  For the API I’m using a bar character (|) to designate a ranged date and a dash to designate a division between day, month and year.  I can’t use a slash (/) as the search string is passed in the URL and slashes have meaning in URLs.  For info, here are the valid date string patterns:

“XXXX”,”XXXX-XX”,”XXXX-XX-XX”,”XXXX|XXXX”,”XXXX|XXXX-XX”,”XXXX|XXXX-XX-XX”,”XXXX-XX|XXXX”,”XXXX-XX|XXXX-XX”,”XXXX-XX|XXXX-XX-XX”,”XXXX-XX-XX|XXXX”,”XXXX-XX-XX|XXXX-XX”,”XXXX-XX-XX|XXXX-XX-XX”

So for example, if someone searches for ‘1752’ or ‘1752-03’ or ‘1752-02|1755-07-22’ the system will recognise these as a date search and process them accordingly.  I should point out that I can and probably will get people to enter dates in a more typical way in the front-end, using slashes between day, month and year and a dash between ranged dates (e.g. ‘1752/02-1755/07/22’) but I’ll convert these before passing the string to the API in the URL.

I have the query running to search the dates, and this in itself was a bit complicated to generate as including a month or a month and a day in a ranged query changes the way the query needs to work.  E.g. if the user searches for ‘1752-1755’ then we need to return all borrowing records with a borrowed year of ‘1752’ or later and ‘1755’ or earlier.  However, if the query is ‘1752/06-1755-03’ then the query can’t just be ‘all borrowed records with a borrowed year of ‘1752’ or later and a borrowed month of ‘06’ or later and a borrowed year of ‘1755’ or earlier and a borrowed month of ‘03’ or earlier as this would return no results.  This is because the query is looking to return borrowings with a borrowed month of ‘06’ or later and also ‘03’ or earlier.  Instead the query needs to find borrowing records that have a borrowed year of 1752 AND a borrowed month of ‘06’ or later OR have a borrowed year later than 1752 AND have a borrowed year of 1755 AND a borrowed month of ‘03’ or earlier OR have a borrowed year earlier than 1755.

I also have the queries running that search for all necessary fields that aren’t dates.  This currently requires five separate queries to be run to check fields like author names, borrower occupations, book edition fields such as ESTC etc.  The queries currently return a list of borrowing IDs, and this is as far as I’ve got.  I’m wondering now whether I should create a cached table for the non-date data queried by the quick search, consisting of a field for the borrowing ID and a field for the term that needs to be searched, with each borrowing having many rows depending on the number of terms they have (e.g. a row for each occupation of every borrower associated with the borrowing, a row for each author surname, a row for each forename, a row for ESTC).  This should make things much speedier to search, but will take some time to generate.  I’ll continue to investigate this next week.

Also this week I updated the structure of the Speech Star database to enable each prompt to have multiple sounds etc.  I had to update the non-disordered page and the child error page to work with the new structure, but it seems to be working.  I also had to update the ‘By word’ view as previously sound, articulation and position were listed underneath the word and above the table.  As these fields may now be different for each record in the table I’ve removed the list and have instead added the data as columns to the table.  This does however mean that the table contains a lot of identical data for many of the rows now.

I then added in tooptip / help text containing information about what the error types mean in the child speech error database.  On the ‘By Error Type’ page the descriptions currently appear as small text to the right of the error type title.  On the ‘By Word’ page  the error type column has an ‘i’ icon after the error type.  Hovering over or pressing on this displays a tooltip with the error description, as you can see in the following screenshot:

I also updated the layout of the video popups to split the metadata across two columns and also changed the order of the errors on the ‘By error type’ page so that the /r/ errors appear in the correct alphabetical order for ‘r’ rather than appearing first due to the text beginning with a slash.  With this all in place I then replicated the changes on the version of the site that is going to be available via the Seeing Speech URL.

Kirsteen McCue contacted me last week to ask for advice on a British Academy proposal she’s putting together and after asking some questions about the project I wrote a bit of text about the data and its management for her.  I also sorted out my flights and accommodation for the workshop I’m attending in Zurich in January and did a little bit of preparation for a session on Digital Humanities that Luca Guariento has organised for next Monday.  I’ll be discussing a couple of projects at this event.  I also exported all of the Speak For Yersel survey data and sent this to a researcher who is going to do some work with the data and fixed an issue that had cropped up with the Place-names of Kirkcudbright website.  I also spent a bit of time on DSL duties this week, helping with some user account access issues and discussing how links will be added from entries to the lengthy essay on the Scots Language that we host on the site.

 

Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.

Week Beginning 10th October 2022

I spent quite a bit of time finishing things off for the Speak For Yersel project.  I created a stats page for the project team to access.  The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day).  If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).

The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.

The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere.  Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys.  I might need to update the page to add more in at a later date.  For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions.  Also, I haven’t included stats about the quizzes or activities yet, but these could be added.

I also worked on an abstract about the project for the Digital Humanities 2023 conference.  In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project.  It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week.  I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data.  I sent this to Jennifer for feedback and then wrote a second version.  Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.

Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary.  It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool.  I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.

I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.

I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page.  I’d started on the API endpoint for this last week, which returned all books for a library and then processed them.  This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use.  Also ordering by number of borrowings and by author requires all books to be returned and processed.  This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews  that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.

I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed.  Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable).  These will eventually be linked to the search.

The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected.  Instead I updated the underlying database to add two new fields to the book holding table.  The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record.  I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS.  Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it.  This makes things much more efficient as less data is being processed at any one time.

I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title.  Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters.  For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.

That’s all for this week.  I’ll be on holiday next week so there won’t be a further report until the week after that.

 

 

Week Beginning 19th September 2022

It was a four-day week this week due to the Queen’s funeral on Monday.  I divided my time for the remaining four days over several projects.  For Speak For Yersel I finally tackled the issue of the way maps are loaded.  The system had been developed for a map to be loaded afresh every time data is requested, with any existing map destroyed in the process.  This worked fine when the maps didn’t contain demographic filters as generally each map only needed to be loaded once and then never changed until an entirely new map was needed (e.g. for the next survey question).  However, I was then asked to incorporate demographic filters (age groups, gender, education level), with new data requested based on the option the user selected.  This all went through the same map loading function, which still destroyed and reinitiated the entire map on each request.  This worked, but wasn’t ideal, as it meant the map reset to its default view and zoom level whenever you changed an option, map tiles were reloaded from the server unnecessarily and if the user was in ‘full screen’ mode they were booted out of this as the full screen map no longer existed.  For some time I’ve been meaning to redevelop this to address these issues, but I’ve held off as there were always other things to tackled and I was worried about essentially ripping apart the code and having to rebuilt fundamental aspects of it.  This week I finally plucked up the courage to delve into the code.

I created a test version of the site so as to not risk messing up the live version and managed to develop an updated method of loading the maps.  This method initiates the map only once when a page is first loaded rather than destroying and regenerating the map every time a new question is loaded or demographic data is changed.  This means the number of map tile loads is greatly reduced as the base map doesn’t change until the user zooms or pans.  It also means the location and zoom level a user has left the map on stays the same when the data is changed.  For example, if they’re interested in Glasgow and are zoomed in on it they can quickly flick between different demographic settings and the map will stay zoomed in on Glasgow rather than resetting each time.  Also, if you’re viewing the map in full-screen mode you can now change the demographic settings without the resource exiting out of full screen mode.

All worked very well, with the only issues being that the transitions between survey questions and quiz questions weren’t as smooth as the with older method.  Previously the map scrolled up and was then destroyed, then a new map was created and the data was loaded into the area before it smoothly scrolled down again.  For various technical reasons this no longer worked quite as well any more.  The map area still scrolls up and down, but the new data only populates the map as the map area scrolls down, meaning for a brief second you can still see the data and legend for the previous question before it switches to the new data.  However, I spent some further time investigating this issue and managed to fix it, with different fixes required for the survey and the quiz.  I also noticed a bug whereby the map would increase in size to fit the available space but the map layers and data were not extending properly into the newly expanded area.  This is a known issue with Leaflet maps that have their size changed dynamically and there’s actually a Leaflet function that sorts it – I just needed to call map.invalidateSize(); and the map worked properly again.  Of course it took a bit of time to figure this simple fix out.

I also made some further updates to the site.  Based on feedback about the difficulty some people are having about which surveys they’ve done, I updated the site to log when the user completes a survey.  Now when the user goes to the survey index page a count of the number of surveys they’ve completed is displayed in the top right and a green tick has been added to the button of each survey they have completed.  Also, when they reach the ‘what next’ page for a survey a count of their completed survey is also shown.  This should make it much easier for people to track what they’ve done.  I also made a few small tweaks to the data at the request of Jennifer, and create a new version of the animated GIF that has speech bubbles, as the bubble for Shetland needed its text changed.  As I didn’t have the files available I took the opportunity regenerate the GIF, using a larger map, as the older version looked quite fuzzy on a high definition screen like an iPad.  I kept the region outlines on as well to tie it in better with our interactive maps.  Also the font used in the new version is now the ‘Baloo’ font we use for the site.  I stored all of the individual frames both as images and as powerpoint slides so I can change them if required.  For future reference, I created the animated GIF using https://ezgif.com/maker with a 150 second delay between slides, crossfade on and a fader delay of 8.

Also this week I researched an issue with the Scots Thesaurus that was causing the site to fail to load.  The WordPress options table had become corrupted and unreadable and needed to be replaced with a version from the backups, which thankfully fixed things.  I also did my expenses from the DHC in Sheffield, which took longer than I thought it would, and made some further tweaks to the Kozeluch mini-site on the Burns C21 website.  This included regenerating the data from a spreadsheet via a script I’d written and tweaking the introductory text.  I also responded to a request from Fraser Dallachy to regenerate some data that a script Id’ previously written had outputted.  I also began writing a requirements document for the redevelopment of the place-names project front-ends to make them more ‘map first’.

I also did a bit more work for Speech Star, making some changes to the database of non-disordered speech and moving the ‘child speech error database’ to a new location.  I also met with Luca to have a chat about the BOSLIT project, its data, the interface and future plans.  We had a great chat and I then spent a lot of Friday thinking about the project and formulating some feedback that I sent in a lengthy email to Luca, Lorna Hughes and Kirsteen McCue on Friday afternoon.

Week Beginning 12th September 2022

I spent a bit of time this week going through my notes from the Digital Humanities Congress last week and writing last week’s lengthy post.  I also had my PDR session on Friday and I needed to spend some time preparing for this, writing all of the necessary text and then attending the session.  It was all very positive and it was a good opportunity to talk to my line manager about my role.  I’ve been in this job for ten years this month and have been writing these blog posts every working week for those ten years, which I think is quite an achievement.

In terms of actual work on projects, it was rather a bitty week, with my time spread across lots of different projects.  On Monday I had a Zoom call for the VariCS project, a phonetics project in collaboration with Strathclyde that I’m involved with.  The project is just starting up and this was the first time the team had all met.  We mainly discussed setting up a web presence for the project and I gave some advice on how we could set up the website, the URL and such things.  In the coming weeks I’ll probably get something set up for the project.

I then moved onto another Burns-related mini-project that I worked on with Kirsteen McCue many months ago – a digital edition of Koželuch’s settings of Robert Burns’s Songs for George Thomson.  We’re almost ready to launch this now and this week I created a page for an introductory essay, migrated a Word document to WordPress to fill the page, including adding in links and tweaking the layout to ensure things like quotes displayed properly.  There are still some further tweaks that I’ll need to implement next week, but we’re almost there.

I also spent some time tweaking the Speak For Yersel website, which is now publicly accessible (https://speakforyersel.ac.uk/) but still not quite finished.  I created a page for a video tour of the resource and made a few tweaks to the layout, such as checking the consistency of font sizes used throughout the site.  I also made some updates to the site text and added in some lengthy static content to the site in the form or a teachers’ FAQ and a ‘more information’ page.  I also changed the order of some of the buttons shown after a survey is completed to hopefully make it clearer that other surveys are available.

I also did a bit of work for the Speech Star project.  There had been some issues with the Central Scottish Phonetic Features MP4s playing audio only on some operating systems and the replacements that Eleanor had generated worked for her but not for me.  I therefore tried uploading them to and re-downloading them from YouTube, which thankfully seemed to fix the issue for everyone.  I then made some tweaks to the interfaces to the two project websites.  For the public site I made some updates to ensure the interface looked better on narrow screens, ensuring changing the appearance of the ‘menu’ button and making the logo and site header font smaller to they take up less space.  I also added an introductory video to the homepage too.

For the Books and Borrowing project I processed the images for another library register.  This didn’t go entirely smoothly.  I had been sent 73 images and these were all upside down so needed rotating.  It then transpired that I should have been sent 273 images so needed to chase up the missing ones.  Once I’d been sent the full set I was then able to generate the page images for the register, upload the images and associate them with the records.

I then moved on to setting up the front-end for the Ayr Place-names website.  In the process of doing so I became aware that one of the NLS map layers that all of our place-name projects use had stopped working.  It turned out that the NLS had migrated this map layer to a third party map tile service (https://www.maptiler.com/nls/) and the old URLs these sites were still using no longer worked.  I had a very helpful chat with Chris Fleet at NLS Maps about this and he explained the situation.  I was able to set up a free account with the maptiler service and update the URLS in four place-names websites that referenced the layer (https://berwickshire-placenames.glasgow.ac.uk/, https://kcb-placenames.glasgow.ac.uk/, https://ayr-placenames.glasgow.ac.uk and https://comparative-kingship.glasgow.ac.uk/scotland/).  I’ll need to ensure this is also done for the two further place-names projects that are still in development (https://mull-ulva-placenames.glasgow.ac.uk and https://iona-placenames.glasgow.ac.uk/).

I managed to complete the work on the front-end for the Ayr project, which was mostly straightforward as it was just adapting what I’d previously developed for other projects.  The thing that took the longest was getting the parish data and the locations where the parish three-letter acronyms should appear, but I was able to get this working thanks to the notes I’d made the last time I needed to deal with parish boundaries (as documented here: https://digital-humanities.glasgow.ac.uk/2021-07-05/.  After discussions with Thomas Clancy about the front-end I decided that it would be a good idea to redevelop the map-based interface to display al of the data on the map by default and to incorporate all of the search and browse options within the map itself.  This would be a big change, and it’s one I had been thinking of implementing anyway for the Iona project, but I’ll try and find some time to work on this for all of the place-name sites over the coming months.

Finally, I had a chat with Kirsteen McCue and Luca Guariento about the BOSLIT project.  This project is taking the existing data for the Bibliography of Scottish Literature in Translation (available on the NLS website here: https://data.nls.uk/data/metadata-collections/boslit/) and creating a new resource from it, including visualisations.  I offered to help out with this and will be meeting with Luca to discuss things further, probably next week.

 

Week Beginning 5th September 2022

I attended the Digital Humanities Congress in Sheffield this week (https://www.dhi.ac.uk/dhc2022/ ), which meant travelling down on the Wednesday and attending the event on the Thursday and Friday (there were also some sessions on Saturday but I was unable to stay for those).  It was an excellent conference featuring some great speakers and plenty of exciting research and I’ll give an overview of the sessions I attended here.  The event kicked off with a plenary session by Marc Alexander, who as always was an insightful and entertaining speaker.  His talk was about the analysis of meaning at different scales, using the Hansard corpus as his main example and thinking about looking at the data from a distance (macro), close up (micro) but also the stuff in the middle that often gets overlooked, which he called the meso.  The Hansard corpus is a record of what has been said in parliament and it began in 1803 and currently runs up to 2003/5 and consists of 7.6 million speeches and 1.6 billion words, all of which have been tagged for part of speech and also semantics and it can be accessed at https://www.english-corpora.org/hansard/.  Marc pointed out that the corpus is not a linguistic transcript as it can be a summary rather than the exact words – it’s not verbatim but substantially so and doesn’t include things like interruptions and hesitations.  The corpus was semantically tagged using the SAMUELS tagger which annotates the texts using data from the Historical Thesaurus.

Marc gave some examples of analysis at different scales.  For micro analysis he looked at the use of ‘draconian’ and how this word does not appear much in the corpus until the 1970s.  He stated that we can use word vectors and collocates at this level of analysis, for example looking at the collocates of ‘privatisation’ after the 1970s, showing the words that appear most frequently are things like rail, electricity, British etc but there are also words such as ‘proceeds’ and ‘proposals’, ‘opposed’ and ‘botched’.  Marc pointed out that ‘botched’ is a word we mostly all know but would not use ourselves.  This is where semantic collocates come in useful – grouping words by their meaning and being able to search for the meanings of words rather than individual forms.  For example, it’s possible to look at speeches by women MPs in the 1990s and find the most common concepts they spoke about, which were things like ‘child’, ‘mother’ ‘parent’.  Male MPs on the other hand talked about things like ‘peace treaties’ and ‘weapons’.  Hansard doesn’t explicitly state the sex of the person so this is based on the titles that are used.

At the Macro level Marc discussed words for ‘uncivilised’ and 2046 references to ‘uncivil’.  At different periods in time there are different numbers of terms available to mean this concept.  The number of words that are available for a concept can show how significant a concept is at different time periods.  It’s possible with Hansard to identify places and also whether a term is used to refer to the past or present, so we can see what places appear near an ‘uncivilised’ term (Ireland, Northern Ireland, India, Russia and Scotland most often).  Also in the past ‘uncivilised’ was more likely to be used to refer to some past time whereas in more recent years it tends to be used to refer to the present.

Marc then discussed some of the limitations of the Hansard corpus.  It is not homogenous but is discontinuous and messy.  It’s also a lot bigger in recent times than historically – 30 million words a year now but much less in the past.  Also until 1892 it was written in the third person.

Marc then discussed the ‘meso’ level.  He discussed how the corpus was tagged by meaning using a hierarchical system with just 26 categories at the top level, so it’s possible to aggregate results.  We can use this to find which concepts are discussed the least often in Hansard, such as the supernatural, textiles and clothing and plants.  We can also compare this with other semantically tagged corpora such as SEEBO and compare the distributions of concept.  There is a similar distribution but a different order.  Marc discussed concepts that are ‘semantic scaffolds’ vs ‘semantic content’.  He concluded by discussing sortable tables and how we tend to focus on the stuff at the top and the bottom and ignore the middle, but that it is here that some of the important things may reside.

The second session I attended featured four short papers.  The first discussed linked ancient world data and research into the creation of and use of this data.  Linked data is of course RDF triples, consisting of a subject, predicate and object, for example ‘Medea was written by Euripides’.  It’s a means of modelling complex relationships and reducing disambiguation by linking through using URIs (e.g. to make it clear we’re talking about ‘Medea’ the play rather than a person).  However, there are barriers for use.  There are lots of authority files and vocabularies and some modelling approaches are incomplete.  Also, modelling uncertainty can be difficult and there is a reliance on external resources.  The speaker discussed LOUD data (Linked, Open, Usable Data) and conducted a survey of linked data use in ancient world studies, consisting of 212 participants and 16 in-depth interviews.  The speaker came up with five principles:

Transparency (openness, availability of export options, documentation); Extensibility (current and future integration based on existing infrastructure); Intuitiveness (making it easy for users to do what they need to do); Reliability (the tool / data does what it says it does consistently – this is a particular problem as SPARQL endpoints for RDF data can become unreachable as servers get overloaded); Sustainability (continued functionality of the resource).

The speaker concluded by stating that human factors are also important in the use of the data, such as collaboration and training, and also building and maintaining a community that can lead to new collaborations and sustainability.

The second speaker discussed mimesis and the importance of female characters in Dutch literary fiction, comparing female characters in novels in the 1960s with those in the 2010s to see if literature is reflecting changes in society (mimesis).  The speaker developed software to enable the automatic extraction of social networks from literary texts and wanted to investigate how each character’s social network changed in novels as the role of women changed in society.  The idea was that female characters would be stronger and more central in the second period.  The data used a corpus of 170 Dutch novels from 2013 and 152 Dutch novels from the 1960s.  Demographic information on 2136 characters was manually compiled and a comprehensive network analysis was semi-automatically generated that identified character and gender resolution (based on first name and pronouns).  Centrality scores were computed from the network diagrams to demonstrate how central a character was.  The results shows that the data for the two time periods was the same on various metrics, with a 60/40 split of male to female characters in both periods.  The speaker referred to this as ‘the golden mean of patriarchy’ where there are two male characters for every female one.  The speaker stated that only one metric had a statistically significant result and that was network centrality, which for all speakers regardless of gender increased between the time periods.  The speaker stated that this was due to a broader cultural trend towards more ‘relational’ novels with more of a focus on relations.

The third speaker discussed a quantitative analysis of digital scholarly editions as part of the ‘C21 Editions’.  The research engaged with 50 scholars who have produced digital editions and produced a white paper on the state of the art of digital editions and also generated a visualisation of a catalogue of digital editions.  The research brought together two existing catalogues of digital editions.  One site (https://v3.digitale-edition.de/) contains 714 digital editions whereas the other one (https://dig-ed-cat.acdh.oeaw.ac.at/) contains 316.

The fourth speaker presented the results of a case study of big data using a learner corpus.  The speaker pointed out that language-based research is changing due to the scale of the data, for example in digital communication such as Twitter, the digitisation of information such as Google Books and the capabilities of analytical tools such as Python and R.  The speaker used a corpus of essays written by non-native English speakers as they were learning English.  It contains more than 1 million texts by more than 100,000 learners from more than 100 countries, with many proficiency levels.  The speaker was interested in lexical diversity in different tasks.  He created a sub-corpus as only 20% of nationalities have more than 100 learners.  He also had to strip out non-English text and remove duplicate texts.  He then identified texts that were about the same topic using topic modelling, identifying keywords, such as cities, weather, sports and the corpus is available here: https://corpus.mml.cam.ac.uk/

After the break I attended a session about mapping and GIS, consisting of three speakers.  The first was about the production of a ‘deep map’ of the European Grand Tour, looking specifically at the islands of Sicily and Cyprus and the identify of places mentioned in the tours.  These were tours by mostly Northern European aristocrats to Southern European countries beginning at the end of the 17th Century.  Richard Lassels’ Voyage of Italy in 1670 was one of the first.  Surviving data the project analysed included diaries and letters full of details of places are the reasons for the visit, which might have been to learn about art or music, observe the political systems, visit universities and see the traces of ancient cultures.  The data included descriptions of places and of routes taken, people met, plus opinions and emotions.  The speaker stated that the subjectivity in the data is an area previously neglected but is important in shaping the identity of a place.  The speaker stated that deep mapping (see for example http://wp.lancs.ac.uk/lakesdeepmap/the-project/gis-deep-mapping/) incorporates all of this in a holistic approach.  The speaker’s project was interested in creating deep maps of Sicily and Cyprus to look at the development of a European identity forged by Northern Europeans visiting Southern Europe – what did the travellers bring back?  And what influence did they leave behind?  Sicily and Cyprus were chosen because they were less visited and are both islands with significant Greek and Roman histories.  They also had a different political situation at the time, with Cyprus under the control of the Ottoman empire.  The speaker discussed the project’s methodology, consisting of the selection of documents ( 18th century diaries of travellers interested in Classical times), looking at discussions of architecture, churches, food and accommodation.  Adjectives were coded and the data was plotted using ArcGIS.  Itineraries were  plotted on a map, with different coloured lines showing routes and places marked.  Eventually the project will produce an interactive web-based map but for now it just runs in ArcGIS.

The second paper in the session discussed using GIS to Illustrate and understand the influence of St Æthelthryth of Ely, a 7th century saint whose cult was one of the most enduring of the Middle Ages.  The speaker was interested in plotting the geographical reach of the cult, looking at why it lasted so long, what its impact was and how DH tools could help with the research.  The speaker stated that GIS tools have been increasingly used since the mid-2000s but are still not used much in medieval studies.  The speaker created a GIS database consisting of 600 datapoints for things like texts, calendars, images and decorations and looked at how the cult expanded throughout the 10th and 11th centuries.  This was due to reforming bishops arriving from France after the Viking pillages, bringing Benedictine rule.  Local saints were used as role models and Ely was transformed.  The speaker stated that one problem with using GIS for historical data is that time is not easy to represent.  He created a sequence of maps to show the increases in land holding from 950 to 1066 and further development in the later middle ages as influence was moving to parish churches.  He mapped parish churches that were dedicated to the saint or had images of her showing the change in distribution over time.  Clusters and patterns emerged showing four areas.  The speaker plotted these in different layers that could be turned on and off, and also overlaid the Gough Map (one of the earliest maps of Britain – http://www.goughmap.org/map/) as a vector layer.  He also overlaid the itineraries of early kings to show different routes and possible pilgrimage routes emerged.

The final paper looked at plotting the history of the holocaust through holocaust literature and mapping, looking to narrate history through topography, noting the time and place of events specifically in the Warsaw ghetto and creating an atlas of holocaust literature (https://nplp.pl/kolekcja/atlas-zaglady/).  The resource consists of three interconnected modules focussing on places, people and events, with data taken from the diaries of 17 notable individuals, such as the composer who was the subject of the film ‘The Pianist’.  The resource features maps of the ghetto with areas highlighted depending on the data the user selects.  Places are sometimes vague – they can be an exact address, a street or an area.  There were also major mapping challenges as modern Warsaw is completely different to during the war and the boundaries of the ghetto changed massively during the course of the war.  The memoirs also sometimes gave false addresses, such as intersections of streets that never crossed.  At the moment the resource is still a pilot which took a year to develop, but it will be broadened out (with about 10 times more data) that will include memoirs written after the events and translations into English.

The final session of the day was another plenary, given by the CEO of ‘In the room’ (see https://hereintheroom.com/) who discussed the fascinating resource the company has created.  It presents interactive video encounters using AI to enable users to ask spoken questions and for the system to pick out and play video clips that closely match the user’s topic.  It began as a project at the National Holocaust Centre and Museum near Nottingham, which organised events where school children could meet survivors of the holocaust.  The question arose as to how to keep these encounters going after the last survivors are no longer with us.  The initial project recorded hundreds of answers to questions with videos responding to actual questions by users.  Users reacted as if they were encountering a real person.  The company was then set up to make a web-enabled version of the tool and to make it scalable.  The tool responds to the ‘power of parasocial relationships’ (one sided relationships) and the desire for personalised experiences.

This led to the development of conversational encounters with famous 11 people, where the user can ask questions by voice, the AI matches the intent and plays the appropriate video clips.  One output was an interview with Nile Rodgers in collaboration with the National Portrait Gallery to create a new type of interactive portrait experience.  Nile answered about 350 questions over two days and the result (see the link above) was a big success, with fans reporting feeling nervous when engaging with the resource.  There are also other possible uses for the technology in education, retail and healthcare (for example a database of 12,000 answers to questions about mental health).

The system can suggest follow-up questions to help the educational learning experience and when analysing a question the AI uses confidence levels.  If the level is too low then a default response is presented.  The system can work in different languages, with the company currently working with Munich holocaust survivors.  The company is also working with universities to broaden access to lecturers.  Students felt they got to know the lecturer better as if really interacting with them.  A user survey suggested that 91% of 18-23 year olds believed the tool would be useful for learning.  As a tool it can help to immediately identify which content is relevant and as it is asynchronous the answers can be found at any time.  The speaker stated that conversational AI is growing and is not going away – audiences will increasingly expect such interactions in their lives.

The second day began with another parallel session.  The first speaker in the session I chose discussed how a ‘National Collection’ could be located through audience research.  The speaker discussed a project that used geographical information such as where objects were made, where they reside, the places they depict or describe and brought all this together on maps of locations where participants live.  Data was taken from GLAMs (Galleries, Libraries, Archives, Museums) and Historic Environment and looked at how connections could be drawn between these.  The project looked at a user centred approach when creating a map interface – looking at the importance of local identity in understanding audience motivations.  The project conducted qualitative research (a user survey) and quantitative research (focus groups) and devised ‘pretotypes’ as focus group stimulus.

The idea was to create a map of local cultural heritage objects similar to https://astreetnearyou.org that displays war records related to a local neighbourhood.  Objects that might have no interest to a person become interesting due to their location in their neighbourhood.  The system created was based on the Pelagios methodology (https://pelagios.org/) and used a tool called locolligo (https://github.com/docuracy/Locolligo) to convert CSV data into JSONLD data.

The second speaker was a developer of DH resources who has worked at Sheffield’s DHI for 20 years.  His talk discussed how best to manage the technical aspects of DH projects in future.  He pointed out that his main task as a developer is to get data online for the public to use and that the interfaces are essentially the same.  He pointed out that these days we mostly get out online content through ‘platforms’ such as Twitter, Tiktok and Instagram.  There has been much ‘web consolidation’ away from individual websites to these platforms.  However, this hasn’t happened in DH, which is still very much about individual website, discrete interfaces and individual voices.  But these leads to a problem with maintenance of the resources.  The speaker mentioned the AHDS service that used to be a repository for Arts and Humanities data, but this closed in 2008.  The speaker also talked about FAIR data (Findable, Accessible, Interoperable, Reusable) and how depositing data in an institutional repository doesn’t really fit into this.  Generally data is just a dataset.  Project websites generally contain a lot of static ancillary pages and these can be migrated to a modern CMS such as WordPress, but what about the record level data?  Generally all DH websites have a search form, search results and records.  The underlying structure is generally the same too – a data store, a back end and a front end.  These days at DHI the front end is often built using Rect.js or Angular, with Elastic Search for data store and Symphony as back end.   The speaker in interested in how to automatically generate a DH interface for a project’s data, such as generating the search form by indexing the data.  The generated front-end can then be customised but the data should need minimal interpretation by the front-end.

The third speaker in the session discussed exploratory notebooks for cultural heritage datasets, specifically Jupyter notebooks used with datasets at the NLS, which can be found here: https://data.nls.uk/tools/jupyter-notebooks/.  The speaker stated that the NLS aims to have digitised a third of its 31 million objects by 2025 and has developed a data foundry to make data available to researchers. Data have to be open, transparent (i.e. include provenance) and practical (i.e. in usable file formats).  Jupyter Notebooks allow people to explore and analyse the data without requiring any coding ability.  Collections can be accessed as data and there are tutorials on things like text analysis.  The notebooks use Python and the NLTK (https://www.nltk.org/) and the data has been cleaned and standardised, and is available in various forms such as lemmatised, normalised, stemmed.  The notebooks allow for data analysis, summary and statistics such as lexical diversity in the novels of Lewis Grassic Gibbon over time.  The notebooks launched in September 2020.  The notebooks can also be run in the online service Binder (https://mybinder.org/).

After the break there was another parallel session and the one I attended mostly focussed on crowdsourcing.  The first talk was given remotely by a speaker based in Belgium and discussed the Europeana photography collection, which currently holds many millions of items including numerous virtual exhibitions, for example one on migration (https://www.europeana.eu/en/collections/topic/128-migration).  This allows you to share your migration story including adding family photos.  The photo collection’s search options uses a visual similarity search that uses AI to perform pattern matching but there have been mixed results.  Users can also create their own galleries and the project organised a ‘subtitle-a-thon’ which encouraged users to create subtitles for videos in their own languages.  There is also a project called https://www.citizenheritage.eu/ to engage with people.

The second speaker discussed ‘computer vision and the history of printing’ and discussed the amazing work of the Visual Geometry Group at the University of Oxford (https://www.robots.ox.ac.uk/~vgg/).  The speaker discussed a ‘computer vision pipeline’ through which images were extracted from a corpus and clustered by similarity and uniqueness.  He first step was to extract illustrations from pages of text using an object detection model.  This used the EfficientDet object detector (https://towardsdatascience.com/a-thorough-breakdown-of-efficientdet-for-object-detection-dc6a15788b73) which was trained on the Microsoft Common Objects for Context (COCO) dataset, which has labelled objects for 328,000 images.   Some 3609 illustrated pages were extracted, although there were some false positives, such as bleed through, printers marks and turned up pages.  Images were then passed through image segmentation where every pixel was annotated to identify text blocks, initials etc.  The segmentation model used was Mask R-CNN (https://github.com/matterport/Mask_RCNN) and a study of image pretraining for historical document image analysis can be found here: https://arxiv.org/abs/1905.09113.

The speaker discussed image matching versus image classification and the VGG Image Search Engine (VISE, https://www.robots.ox.ac.uk/~vgg/software/vise/) that is image matching and can search and identify geometric features, matching features regardless of rotation and skewing (but it breaks with flipping or warping).

All of this was used to perform a visual analysis of chapbooks printed in Scotland to identify illustrations that are ‘the same’.  There is variation in printing, corrections in pen etc but the thresholds for ‘the same’ depends on the purpose.  The speaker mentioned that image classification using deep learning is different – it can be used to differentiate images of cats and dogs, for example.

The final speaker in the session followed on very nicely from the previous speaker, as his research was using many of the same tools.  This project was looking at image recognition using images from the protestant reformation so discover how and where illustrations were used by both Protestants and Catholics during the period, looking specifically at printing, counterfeiting and illustrations of Martin Luther.  The speaker discussed his previous project, which was called Ornamento and looked at 160,000 distinct editions – some 70 million pages – and extracted 5.7 million illustrations.  This used Google Books and PDFs as source material.  It identified illustrations and their coordinates on the page and then classified the illustrations, for example borders, devices, head pieces, music.  These were preprocessed and the results were put in a database.  So for example it was possible to say that a letter appeared in 86 books in five different places.  There was also a nice comparison tool for comparing images such as using a slider.

For the current project the researcher aimed to identify anonymous books by use of illustrated letters.  For example, the tool was able to identify 16 books that were all produced in the same workshop in Leipzig.  The project looked at printing in the Holy Roman Empire from 1450-1600, religious books and only those with illustrations, so a much smaller project than the previous one.

The final parallel session of the day has two speakers.  The first discussed how historical text collections can be unlocked by the use of AI.  This project looked at the 80 editions of the Encyclopaedia Britannica that have been digitised by the NLS.  AI was to be used to group similar articles and detect how articles have changed using machine learning.  The processed included information extraction, the creation of ontologies and knowledge graphs and deep transfer learning (see https://towardsdatascience.com/what-is-deep-transfer-learning-and-why-is-it-becoming-so-popular-91acdcc2717a).  The plan was to detect, classify and extract all of the ‘terms’ in the data.  Terms could either be articles (1-2 paragraphs in length) and topics (several pages).  The project used the Defoe Python library (https://github.com/alan-turing-institute/defoe) to read XML, ingest the text and perform NLP preprocessing.  The system was set up to detect when articles and topics began and ended to store page coordinates for such breaks, although headers changed over the editions which made this trickier.  The project then created an EB ontology and knowledge graph, which is available at https://francesnlp.github.io/EB-ontology/doc/index-en.html.  The EB knowledge graph RDF then allowed querying such as looking at ‘science’ as a node and see how this connected across all editions.  The graph at the above URL contains the data from the first 8 editions.

The second paper discussed a crowdsourcing project called ‘operation war diary’ that was a collaboration between The National Archives, Zooniverse and the Imperial War Museum (https://www.operationwardiary.org/).  The presenter had been tasked with working with the crowdsourced data in order to produce something from it but the data was very messy.  The paper discussed how to deal with uncertainty in crowdsourced data, looking at ontological uncertainty, aleatory uncertainty and epistemic uncertainty.  The speaker discussed the differences between accuracy and precision – how a cluster of results can be precise (grouped closely together) but wrong.  The researcher used OpenRefine (https://openrefine.org/) to work with the data in order to produce clusters of placenames, resulting in 26910 clusters from 500,000 datapoints.  She also looked at using ‘nearest neighbour’ and Levenshtein but there were issues with false positives (e.g. ‘Trench A’ and ‘Trench B’ are only one character apart but are clearly not the same).  The researcher also discussed the outcomes of the crowdsourcing project, stating that only 10% of the 900,000 records were completed.  Many pages were skipped, with people stopping at the ‘boring bits’.  The speaker stated that ‘History will never be certain, but we can make it worse’ which I thought was a good quote.  She suggested that crowdsourcing outputs should be weighted in favour of the volunteers who did the most.  The speaker also pointed out that there are currently no available outputs from the project, and it was hampered by being an early Zooniverse project before the tool was well established.  During the discussion after the talk someone suggested that data could have different levels of acceptability like food nutrition labels.  It was also mentioned that representing uncertainty in visualisations is an important research area, and that visualisations can help identify anomalies.  Another comment was that crowdsourcing doesn’t save money and time – managers and staff are needed and in many cases the work could be done better by a paid team in the same time.  The important reason to choose crowdsourcing is to democratise data, not to save money.

The final session of the day was another plenary.  This discussed different scales of analysis in digital projects and was given by the PI of the ‘Living with Machines’ project (https://livingwithmachines.ac.uk/).  The speaker stated that English Literature mostly focussed on close reading while DH mostly looked at distant reading.  She stated that scalable reading was like archaeology – starting with an aerial photo at the large scale to observe patterns then moving to excavation of a specific area, then iterating again.  The speaker had previously worked on the Tudor Networks of Power project (https://tudornetworks.net/) which has a lovely high-level visualisation of the project’s data.  It dealt with around 130,000 letters.  Next came Networking Archives project, which doesn’t appear to be online but has some information here: https://networkingarchives.github.io/blog/about/.  This project dealt with 450,000 letters.  Then came ‘Living with Machines’ which is looking at even larger corpora.  How to move through different scales of analysis is an interesting research question.  The Tudor project used the Tudor State Papers from 1509 to 1603 and dealt with 130,000 letters and 20,000 people.  The top-level interface facilitated discovery rather than analysis.  The archive is dominated by a small number of important people that can be discovered via centrality and betweenness – the number of times a person’s record is intersected.  When looking at the network for a person you can then compare this to people with similar network profiles like a fingerprint.  By doing so it is possible to identify one spy and then see if others with a similar profile may also have been spies.  But actual deduction requires close reading so iteration is crucial.  The speaker also mentioned the ‘mesa scale’ – the data in the middle.  The resource enables researchers to identify who was at the same location at the same time – people who never corresponded with each other but may have interacted in person.

The Networking Archives project used the Tudor State Papers but also brought in the data from EMLO.  The distribution was very similar with 1-2 highly connected people and most people only having a few connections.  The speaker discussed the impact of missing data.  We can’t tell how much data is already missing from the archive, but we can tell what impact it might have by progressively removing more data from what we do have.  Patterns in the data are surprisingly robust even when 60-70% of the data has been removed, and when removing different types of data such as folios or years.  The speaker also discussed ‘ego networks’ that show the shared connections two people have – the people in the middle between two figures.

The speaker then discussed the ‘Living with Machines’ project, which is looking at the effects of machination from 1780 to 1920, looking at newspapers, maps, books, census records and journals.  It is a big project with 28 people in the project team.  The project is looking at language model predictions using software called BERT (https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270) and using Word2Vec to cluster words into topics (https://www.tensorflow.org/tutorials/text/word2vec).  One example was looking at the use of the terms ‘man’, ‘machine’ and ‘slave’ to see where they are interchangeable.

The speaker ended with a discussion of how many different types of data can come together for analysis.  She mentioned a map reader system that could take a map, cut it into squares and then be trained to recognise rail infrastructure in the squares.  Rail network data  can then be analysed alongside census data to see how the proximity to stations affects the mean number of servants per household, which was fascinating to hear about.

And that was the end of the conference for me, as I was unable to attend the sessions on the Saturday.  It was an excellent event, I learnt a great deal about new research and technologies and I’m really glad I was given the opportunity to attend.

Before travelling to the event on Wednesday this was just a regular week for me and I’ll give a summary of what I worked on now.  For the SpeechStar project I updated the database of normative speech on the Seeing Speech version of the site to include the child speech videos, as previously these had only been added to the other site.  I also changed all occurrences of ‘Normative’ to ‘non-disordered’ throughout the sites and added video playback speed options to the new Central Scottish phonetic features videos.

I also continued to process library registers and their images for the Books and Borrowing project.  I processed three registers from the Royal High School, each of which required different amounts of processing and different methods.  This included renaming images, adding in missing page records, creating entire new runs of page records, uploading hundreds of images and changing the order of certain page records.  I also wrote a script to identify which page records still did not have associated image files after the upload, as each of the registers is missing some images.

For the Speak For Yersel project I arranged for more credit to be added to our mapping account in case we get lots of hits when the site goes live, and I made various hopefully final tweaks to text throughout the site.  I also gave some advice to students working on the migration of the Scottish Theatre Studies journal, spoke to Thomas Clancy about further work on the Ayr Place-names project and fixed a minor issue with the DSL.

 

Week Beginning 29th August 2022

I divided my time between a number of different projects this week.  For Speak For Yersel I replaced the ‘click’ transcripts with new versions that incorporated shorter segments and more highlighted words.  As the segments were now different I also needed to delete all existing responses to the ‘click’ activity.  I then completed the activity once for each speaker to test things out, and all seems to work fine with the new data.  I also changed the pop-up ‘percentage clicks’ text to ‘% clicks occurred here’, which is more accurate than the previous text which suggested it was the percentage of respondents.  I also fixed an issue with the map height being too small on the ‘where do you think this speaker is from’ quiz and ensured the page scrolls to the correct place when a new question is loaded.  I also removed the ‘tip’ text from the quiz intros and renamed the ‘where do you think this speaker is from’ map buttons on the map intro page.  I’d also been asked to trim down the number of ‘translation’ questions from the ‘I would never say that’ activity so I removed some of those.  I then changed and relocated the ‘heard in films and TV’ explanatory text removed the question mark from the ‘where do you think the speaker is from’ quiz intro page.

Mary had encountered a glitch with the transcription popups, whereby the page would flicker and jump about when certain popups were hovered over.  This was caused by the page height increasing to accommodate the pop-up, causing a scrollbar to appear in the browser, which changed the position of the cursor and made the pop-up disappear, making the scrollbar go and causing a glitchy loop.  I increased the height of the page for this activity so the scrollbar issue is no longer encountered, and I also made the popups a bit wider so they don’t need to be as long.  Mary also noticed that some of the ‘all over Scotland’ dynamically generated map answers seemed to be incorrect.  After some investigation I realised that this was a bug that had been introduced when I added in the ‘I would never say that’ quizzes on Friday.  A typo in the code meant that the 60% threshold for correct answers in each region was being used rather than ‘100 divided by the number of answer options’.  Thankfully once identified this was easy to fix.

I also participated in a Zoom call for the project this week to discuss the launch of the resource with the University’s media people.  It was agreed that the launch will be pushed back to the beginning of October as this should be a good time to get publicity.  Finally for the project this week I updated the structure of the site so that the ‘About’ menu item could become a drop-down menu, and I created placeholder pages for three new pages that will be added to this menu for things like FAQs.

I also continued to work on the Books and Borrowing project this week.  On Friday last week I didn’t quite get to finish a script to merge page records for one of the St Andrews registers as it needed further testing on my local PC before I ran it on the live data.  I tackled this issue first thing on Monday and it was a task I had hoped would only take half an hour or so.  Unfortunately things did not go well and it took most of the morning to sort out.  I initially attempted to run things on my local PC to test everything out, but I forgot to update the database connection details.  Usually this wouldn’t be an issue as generally the databases I work with use ‘localhost’ as a connection URL, so the Stirling credentials would have been wrong for my local DB and the script would have just quit, but Stirling (where the system is hosted) uses a full URL instead of ‘localhost’.  This meant that even though I had a local copy of the database on my PC and the scripts were running on a local server set up on my PC the scripts were in fact connecting to the real database at Stirling.  This meant the live data was being changed.  I didn’t realise this as the script was running and as it was taking some time I cancelled it, meaning the update quit halfway through changing borrowing records and deleting page records in the CMS.

 

I then had to write a further script to delete all of the page and borrowing records for this register from the Stirling server and reinstate the data from my local database.  Thankfully this worked ok.  I then ran my test script on the actual local database on my PC and the script did exactly what I wanted it to do, namely:

Iterate through the pages and for each odd numbered page move the records on these to the preceding even numbered page, and at the same time regenerate the ‘page order’ for each record so they follow on from the existing records.  Then the even page needs its folio number updated to add in the odd number (e.g. so folio number 2 becomes ‘2-3’) and generate an image reference based on this (e.g. UYLY207-2_2-3).  Then delete the odd page record and after all that is done regenerate the ‘next’ and ‘previous’ page links for all pages.

This all worked so I ran the script on the server and updated the live data.  However, I then noticed that there are gaps in the folio numbers and this has messed everything up.  For example, folio number 314 isn’t followed by 315 but 320.  320 isn’t an odd number so it doesn’t get joined to 314.  All subsequent page joins are then messed up.  There are also two ‘350’ pages in the CMS and two images that reference 350.  We have UYLY207-2_349-350 and also UYLY207-2_350-351.  There might be other situations where the data isn’t uniform too.

I therefore had to use my ‘delete and reinsert’ script again to revert to the data prior to the update as my script wasn’t set up to work with pages that don’t just increment their folio number by 1 each time.  After some discussion with the RA I updated the script again so that it would work with the non-uniform data and thankfully all worked fine after that.  Later in the week I also found some time to process two further St Andrews registers that needed their pages and records merged, and thankfully these went much smoother.

I also worked on the Speech Star project this week.  I created a new page on both of the project’s sites (which are not live yet) for viewing videos of Central Scottish phonetic features.  I also replaced the temporary logos used on the sites with the finalised logos that had been designed by a graphic designer.  However, the new logo only really works well on a white background as the white cut-out round the speech bubble into the star becomes the background colour of the header.  The blue that we’re currently using for the site header doesn’t work so well with the logo colours.  Also, the graphic designer had proposed using a different font for the site and I decided to make a new interface for the site, which you can see below.  I’m still waiting for feedback to see whether the team prefer this to the old interface (a screenshot of which you can see on this page: https://digital-humanities.glasgow.ac.uk/2022-01-17/) but I personally think it looks a lot better.

I also returned to the Burns Manuscript database that I’d begun last week.  I added a ‘view record’ icon to each row which if pressed on opens a ‘card’ view of the record on its own page.  I also added in the search options, which appear in a section above the table.  By default, the section is hidden and you can show/hide it by pressing on a button.  Above this I’ve also added in a placeholder where some introductory text can go.  If you open the ‘Search options’ section you’ll find text boxes where you can enter text for year, content, properties and notes.  For year you can either enter a specific year or a range.  The other text fields are purely free-text at the moment, so no wildcards.  I can add these in but I think it would just complicate things unnecessarily.  On the second row are checkboxes for type, location name and condition.  You can select one or more of each of these.

The search options are linked by AND, and the checkbox options are linked internally by OR.  For example, filling in ‘1780-1783’ for year and ‘wrapper’ for properties will find all rows with a date between 1780 and 1783 that also have ‘wrapper’ somewhere in their properties.  If you enter ‘work’ in content and select ‘Deed’ and ‘Fragment’ as types you will find all rows that are either ‘Deed’ or ‘Wrapper’ and have ‘work’ in their content.

If a search option is entered and you press the ‘Search’ button the page will reload with the search options open, and the page will scroll down to this section.  Any rows matching your criteria will be displayed in the table below this.  You can also clear the search by pressing on the ‘Clear search options’ button.  In addition, if you’re looking at search results and you press on the ‘view record’ button the ‘Return to table’ button on the ‘card’ view will reload the search results.  That’s this mini-site completed now, pending feedback from the project team, and you can see a screenshot of the site with the search box open below:

Also this week I’d arranged an in-person coffee and catch up with the other College of Arts developers.  We used to have these meetings regularly before Covid but this was the first time since then that we’d all met up.  It was really great to chat with Luca Guariento, Stevie Barrett and David Wilson again and to share the work we’d been doing since we last met.  Hopefully we can meet again soon.

Finally this week I helped out with a few WordPress questions from a couple of projects and I also had a chance to update all of the WordPress sites I manage (more than 50) to the most recent version.

 

Week Beginning 25th July 2022

I was on holiday for most of the previous two weeks, working two days during this period.  I’ll also be on holiday again next week, so I’ve had quite a busy time getting things done.  Whilst I was away I dealt with some queries from Joanna Kopaczyk about the Future of Scots website.  I also had to investigate a request to fill in timesheets for my work on the Speak For Yersel project, as apparently I’d been assigned to the project as ‘Directly incurred’ when I should have been ‘Directly allocated’.  Hopefully we’ll be able to get me reclassified but this is still in-progress.  I also fixed a couple of issues with the facility to export data for publication for the Berwickshire place-name project for Carole Hough, and fixed an issue with an entry in the DSL, which was appearing in the wrong place in the dictionary.  It turned out that the wrong ‘url’ tag had been added to the entry’s XML several years ago and since then the entry was wrongly positioned.  I fixed the XML and this sorted things.  I also responded to a query from Geert of the Anglo-Norman Dictionary about Aberystwyth’s new VPN and whether this would affect his access to the AND.  I also investigated an issue Simon Taylor was having when logging into a couple of our place-names systems.

On the Monday I returned to work I launched two new resources for different projects.  For the Books and Borrowing project I published the Chambers Library Map (https://borrowing.stir.ac.uk/chambers-library-map/) and reorganised the site menu to make space for the new page link.  The resource has been very well received and I’m pretty pleased with how it’s turned out.  For the Seeing Speech project I launched the new Gaelic Tongues resource (https://www.seeingspeech.ac.uk/gaelic-tongues/) which has received a lot of press coverage, which is great for all involved.

I spent the rest of the week dividing my time primarily between three projects:  Speak For Yersel, Books and Borrowing and Speech Star.  For Books and Borrowing I continued processing the backlog of library register image files that has built up.  There were about 15 registers that needed to be processed, and each needed to be handled in a different way.  This included nine registers from Advocates Library that had been digitised by the NLS, for which I needed to batch process the images to rename them, delete blank pages, create page records in the CMS and then tweak the automatically generated folio numbers to account for discrepancies in the handwritten page number in the images.  I also processed a register for the Royal High School, which involved renaming the images so they match up with image numbers already assigned to page records in the CMS, inserting new page records and updating the ‘next’ and ‘previous’ links for pages for which new images had been uncovered and generating new page records for many tens of new pages that follow on from the ones that have already been created in the CMS.  I also uploaded new images for the Craigston register and created a new register including all page records and associated image URLs for a further register for Aberdeen.  I still have some further RHS registers to do and a few from St Andrews, but these will need to wait until I’m back from my holiday.

For Speech Star I downloaded a ZIP containing 500 new ultrasound MP4 videos.  I then had to process them to generate ‘poster’ images for each video (these are images that get displayed before the user chooses to play the video).  I then had to replace the existing normalised speech database with data from a new spreadsheet that included these new videos plus updates to some of the existing data.  This included adding a few new fields and changing the way the age filter works, as much of the new data is for child speakers who have specific ages in months and years, and these all need to be added to a new ‘under 18’ age group.

For Speak For Yersel I had an awful lot to do.  I started with a further large-scale restructuring of the website following feedback from the rest of the team.  This included changing the site menu order, adding in new final pages to the end of surveys and quizzes and changing the text of buttons that appear when displaying the final question.

I then developed the map filter options for age and education for all of the main maps.  This was a major overhaul of the maps.  I removed the slide up / slide down of the map area when an option is selected as this was a bit long and distracting.  Now the map area just updates (although there is a bit of a flicker as the data gets replaced).  The filter options unfortunately make the options section rather big, which is going to be an issue on a small screen.  On my mobile phone the options section takes up 100% of the width and 80% of the height of the map area unless I press the ‘full screen’ button.  However, I figured out a way to ensure that the filter options section scrolls if the content extends beyond the bottom of the map.

I also realised that if you’re in full screen mode and you select a filter option the map exits full screen as the map section of the page reloads.  This is very annoying, but I may not be able to fix it as it would mean completely changing how the maps are loaded.  This is because such filters and options were never intended to be included in the maps and the system was never developed to allow for this.  I’ve had to somewhat shoehorn in the filter options and it’s not how I would have done things had I known from the beginning that these options were required.  However, the filters work and I’m sure they will be useful.  I’ve added in filters for age, education and gender, as you can see in the following screenshot:

I also updated the ‘Give your word’ activity that asks to identify younger and older speakers to use the new filters too.  The map defaults to showing ‘all’ and the user then needs to choose an age.  I’m still not sure how useful this activity will be as the total number of dots for each speaker group varies considerably, which can easily give the impression that more of one age group use a form compared to another age group purely because one age group has more dots overall.  The questions don’t actually ask anything about geographical distribution so having the map doesn’t really serve much purpose when it comes to answering the question.  I can’t help but think that just presenting people with percentages would work better, or some other sort of visualisation like a bar graph or something.

I then moved on to working on the quiz for ‘she sounds really clever’ and so far I have completed both the first part of the quiz (questions about ratings in general) and the second part (questions about listeners from a specific region and their ratings of speakers from regions).  It’s taken a lot of brain-power to get this working as I decided to make the system work out the correct answer and to present it as an option alongside randomly selected wrong answers.  This has been pretty tricky to implement (especially as depending on the question the ‘correct’ answer is either the highest or the lowest) but will make the quiz much more flexible – as the data changes so will the quiz.

Part one of the quiz page itself is pretty simple.  There is the usual section on the left with the question and the possible answers.  On the right is a section containing a box to select a speaker and the rating sliders (readonly).  When you select a speaker the sliders animate to their appropriate location.  I decided to not include the map or the audio file as these didn’t really seem necessary for answering the questions, they would clutter up the screen and people can access them via the maps page anyway (well, once I move things from the ‘activities’ section).  Note that the user’s answers are stored in the database (the region selected and whether this was the correct answer at the time).  Part two of the quiz features speaker/listener true/false questions and this also automatically works out the correct answer (currently based on the 50% threshold).  Note that where there is no data for a listener rating a speaker from a region the rating defaults to 50.  We should ensure that we have at least one rating for a listener in each region before we let people answer these questions.  Here is a screenshot of part one of the quiz in action, with randomly selected ‘wrong’ answers and a dynamically outputted ‘right’ answer:

I also wrote a little script to identify duplicate lexemes in categories in the Historical Thesaurus as it turns out there are some occasions where a lexeme appears more than once (with different dates) and this shouldn’t happen.  These will need to be investigated and the correct dates will need to be established.

I will be on holiday again next week so there won’t be another post until the week after I’m back.