Week Beginning 21st November 2022

I participated in the UCU strike action on Thursday and Friday this week, so it was a three-day week for me.  I spent much of this time researching RDF, SKOS and OWL semantic web technologies in an attempt to understand them better in the hope that they might be of some use for future thesaurus projects.  I’ll be giving a paper about this at a workshop in January so I’m not going to say too much about my investigations here.  There is a lot to learn about, however, and I can see me spending quite a lot more time on this in the coming weeks.

Other than this I returned to working on the Anglo-Norman Dictionary.  I added in a line of text that had somehow been omitted from one of the Textbase XML files and added in a facility to enable project staff to delete an entry from the dictionary.  In reality this just deactivates the entry, removing it from the front-end but still keeping a record of it in the database in case the entry needs to be reinstated.  I also spoke to the editor about some proposed changes to the dictionary management system and begin to think about how these new features will function and how they will be developed.

For the Books and Borrowing project I had a chat with IT Services at Stirling about setting up an Apache Solr system for the project.  It’s looking like we will be able to proceed with this option, which will be great.  I also had a chat with Jennifer Smith about the new Speak For Yersel project areas.  It looks like I’ll be creating the new resources around February next year.  I also fixed an issue with the Place-names of Iona data export tool and discussed a new label that will be applied to data for the ‘About’ box for entries in the Dictionaries of the Scots Language.

I also prepared for next week’s interview panel and engaged in a discussion with IT Services about the future of the servers that are hosted in the College of Arts.

Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.

Week Beginning 31st October 2022

I spent a lot of the week continuing to work on the Books and Borrowing front end.  To begin with I worked on the ‘borrowers’ tab in the ‘library’ page and created an initial version of it.  Here’s an example of how it looks:

As with books, the page lists borrowers alphabetically, in this case by borrower surname.  Letter tabs and counts of the number of borrowers with surnames beginning with the letter appear at the top and you can select a letter to view all borrowers with surnames beginning with the letter.  I had to create a couple of new fields in the borrower table to speed the querying up, saving the initial letter of each borrower’s surname and a count of their borrowings.

The display of borrowers is similar to the display of books, with each borrower given a box that you can press on to highlight.  Borrower ID appears in the top right and each borrower’s full name appears as a green title.  The name is listed as it would be read, but this could be updated if required.  I’m not sure where the ‘other title’ field would go if we did this, though – presumably something like ‘Macdonald, Mr Archibald of Sanda’.

The full information about a borrower is listed in the box, including additional fields and normalised occupations.  Cross references to other borrowers also appear.  As with the ‘Books’ tab, much of this data will be linked to search results once I’ve created the search options (e.g. press on an occupation to view all borrowers with this occupation, press on the number of borrowings to view the borrowings) but this is not in place yet.  You can also change the view from ‘surname’ to ‘top 100 borrowers’, which lists the top 100 most prolific borrowers (or less if there are less than 100 borrowers in the library).  As with the book tab, a number appears at the top left of each record to show the borrower’s place on the ‘hitlist’ and the number of borrowings is highlighted in red to make it easier to spot.

I also fixed some issues with the book and author caches that were being caused by spaces at the start of fields and author surnames beginning with a non-capitalised letter (e.g. ‘von’) which was messing things up as the cache generation script was previously only matching upper case, meaning ‘v’ wasn’t getting added to ‘V’.  I’ve regenerated the cache to fix this.

I then decided to move onto the search rather than the ‘Facts & figures’ tab as I reckoned this should be prioritised.  I began work on the quick search initially, and I’m still very much in the middle of this.  The quick search has to search an awful lot of and to do this several different queries need to be run.  I’ll need to see how this works in terms of performance as I fear the ‘quick’ search risks being better named the ‘slow’ search.

We’ve stated that users will be able to search for dates in the quick search and these need to be handled differently.  For now the API checks to see whether the passed search string is a date by running a pattern match on the string.  This converts all numbers in the string into an ‘X’ character and then checks to see whether the resulting string matches a valid date form.  For the API I’m using a bar character (|) to designate a ranged date and a dash to designate a division between day, month and year.  I can’t use a slash (/) as the search string is passed in the URL and slashes have meaning in URLs.  For info, here are the valid date string patterns:

“XXXX”,”XXXX-XX”,”XXXX-XX-XX”,”XXXX|XXXX”,”XXXX|XXXX-XX”,”XXXX|XXXX-XX-XX”,”XXXX-XX|XXXX”,”XXXX-XX|XXXX-XX”,”XXXX-XX|XXXX-XX-XX”,”XXXX-XX-XX|XXXX”,”XXXX-XX-XX|XXXX-XX”,”XXXX-XX-XX|XXXX-XX-XX”

So for example, if someone searches for ‘1752’ or ‘1752-03’ or ‘1752-02|1755-07-22’ the system will recognise these as a date search and process them accordingly.  I should point out that I can and probably will get people to enter dates in a more typical way in the front-end, using slashes between day, month and year and a dash between ranged dates (e.g. ‘1752/02-1755/07/22’) but I’ll convert these before passing the string to the API in the URL.

I have the query running to search the dates, and this in itself was a bit complicated to generate as including a month or a month and a day in a ranged query changes the way the query needs to work.  E.g. if the user searches for ‘1752-1755’ then we need to return all borrowing records with a borrowed year of ‘1752’ or later and ‘1755’ or earlier.  However, if the query is ‘1752/06-1755-03’ then the query can’t just be ‘all borrowed records with a borrowed year of ‘1752’ or later and a borrowed month of ‘06’ or later and a borrowed year of ‘1755’ or earlier and a borrowed month of ‘03’ or earlier as this would return no results.  This is because the query is looking to return borrowings with a borrowed month of ‘06’ or later and also ‘03’ or earlier.  Instead the query needs to find borrowing records that have a borrowed year of 1752 AND a borrowed month of ‘06’ or later OR have a borrowed year later than 1752 AND have a borrowed year of 1755 AND a borrowed month of ‘03’ or earlier OR have a borrowed year earlier than 1755.

I also have the queries running that search for all necessary fields that aren’t dates.  This currently requires five separate queries to be run to check fields like author names, borrower occupations, book edition fields such as ESTC etc.  The queries currently return a list of borrowing IDs, and this is as far as I’ve got.  I’m wondering now whether I should create a cached table for the non-date data queried by the quick search, consisting of a field for the borrowing ID and a field for the term that needs to be searched, with each borrowing having many rows depending on the number of terms they have (e.g. a row for each occupation of every borrower associated with the borrowing, a row for each author surname, a row for each forename, a row for ESTC).  This should make things much speedier to search, but will take some time to generate.  I’ll continue to investigate this next week.

Also this week I updated the structure of the Speech Star database to enable each prompt to have multiple sounds etc.  I had to update the non-disordered page and the child error page to work with the new structure, but it seems to be working.  I also had to update the ‘By word’ view as previously sound, articulation and position were listed underneath the word and above the table.  As these fields may now be different for each record in the table I’ve removed the list and have instead added the data as columns to the table.  This does however mean that the table contains a lot of identical data for many of the rows now.

I then added in tooptip / help text containing information about what the error types mean in the child speech error database.  On the ‘By Error Type’ page the descriptions currently appear as small text to the right of the error type title.  On the ‘By Word’ page  the error type column has an ‘i’ icon after the error type.  Hovering over or pressing on this displays a tooltip with the error description, as you can see in the following screenshot:

I also updated the layout of the video popups to split the metadata across two columns and also changed the order of the errors on the ‘By error type’ page so that the /r/ errors appear in the correct alphabetical order for ‘r’ rather than appearing first due to the text beginning with a slash.  With this all in place I then replicated the changes on the version of the site that is going to be available via the Seeing Speech URL.

Kirsteen McCue contacted me last week to ask for advice on a British Academy proposal she’s putting together and after asking some questions about the project I wrote a bit of text about the data and its management for her.  I also sorted out my flights and accommodation for the workshop I’m attending in Zurich in January and did a little bit of preparation for a session on Digital Humanities that Luca Guariento has organised for next Monday.  I’ll be discussing a couple of projects at this event.  I also exported all of the Speak For Yersel survey data and sent this to a researcher who is going to do some work with the data and fixed an issue that had cropped up with the Place-names of Kirkcudbright website.  I also spent a bit of time on DSL duties this week, helping with some user account access issues and discussing how links will be added from entries to the lengthy essay on the Scots Language that we host on the site.

 

Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.

Week Beginning 10th October 2022

I spent quite a bit of time finishing things off for the Speak For Yersel project.  I created a stats page for the project team to access.  The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day).  If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).

The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.

The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere.  Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys.  I might need to update the page to add more in at a later date.  For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions.  Also, I haven’t included stats about the quizzes or activities yet, but these could be added.

I also worked on an abstract about the project for the Digital Humanities 2023 conference.  In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project.  It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week.  I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data.  I sent this to Jennifer for feedback and then wrote a second version.  Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.

Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary.  It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool.  I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.

I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.

I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page.  I’d started on the API endpoint for this last week, which returned all books for a library and then processed them.  This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use.  Also ordering by number of borrowings and by author requires all books to be returned and processed.  This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews  that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.

I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed.  Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable).  These will eventually be linked to the search.

The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected.  Instead I updated the underlying database to add two new fields to the book holding table.  The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record.  I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS.  Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it.  This makes things much more efficient as less data is being processed at any one time.

I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title.  Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters.  For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.

That’s all for this week.  I’ll be on holiday next week so there won’t be a further report until the week after that.

 

 

Week Beginning 3rd October 2022

The Speak For Yersel project launched this week and is now available to use here: https://speakforyersel.ac.uk/.  It’s been a pretty intense project to work on and has required much more work than I’d expected, but I’m very happy with the end result.  We didn’t get as much media attention as we were hoping for, but social media worked out very well for the project and in the space of a week we’d had more than 5,000 registered users completing thousands of survey questions.  I spent some time this week tweaking things after the launch.  For example, I hadn’t added the metadata tags required by Twitter and Facebook / WhatsApp to nicely format links to the website (for example the information detailed here https://developers.facebook.com/docs/sharing/webmasters/) and it took a bit of time to add these in with the correct content.

I also gave some advice to Anja Kuschmann at Strathclyde about applying for a domain for the new VARICS project I’m involved with and investigated a replacement batch of videos that Eleanor had created for the Seeing Speech website.  I’ll need to wait until she gets back to me with files that match the filenames used on the existing site before I can take this further, though.  I also fixed an issue with the Berwickshire place-names website which has lost its additional CSS and investigated a problem with the domain for the Uist Saints website that has still unfortunately not been resolved.

Other than these tasks I spent the rest of the week continuing to develop the front-end for the Books and Borrowing project.  I completed an initial version of the ‘page’ view, including all three views (image, text and image and text).  I added in a ‘jump to page’ feature, allowing you (as you might expect) to jump directly to any page in the register when viewing a page.  I also completed the ‘text’ view of the page, which now features all of the publicly accessible data relating to the records – borrowing records, borrowers, book holding and item records and any associated book editions and book works, plus associated authors.  There’s an awful lot of data and it took quite a lot of time to think about how best to lay it all out (especially taking into consideration screens of different sizes), but I’m pretty happy with how this first version looks.

Currently the first thing you see for a record is the transcribed text, which is big and green.  Then all fields relating to the borrowing appear under this.  The record number as it appears on the page plus the record’s unique ID are displayed in the top right for reference (and citation).  Then follows a section about the borrower, with the borrower’s name in green (I’ve used this green to make all of the most important bits of text stand out from the rest of the record but the colour may be changed in future).  Then follows the information about the book holding and any specific volumes that were borrowed.  If there is an associated site-wide book edition record (or records) these appear in a dark grey box, together with any associated book work record (although there aren’t many of these associations yet).  If there is a link to a library record this appears as a button on the right of the record.  Similarly, if there’s an ESTC and / or other authority link for the edition these appear to the right of the edition section.

Authors now cascade down through the data as we initially planned.  If there’s an author associated with a work it is automatically associated with and displayed alongside the edition and holding.  If there’s an author associated with an editon but not a work it is then associated with the holding.  If a book at a specific level has an author specified then this replaces any cascading author from this point downwards in the sequence.  Something that isn’t in place yet are the links from information to search results, as I haven’t developed the search yet.  But eventually things like borrower name, author, book title etc will be links allowing you to search directly for the items.

One other thing I’ve added in is the option to highlight a record.  Press anywhere in a record and it is highlighted in yellow.  Press again to reset it.  This can be quite useful as you’re scrolling through a page with lots of records on if there are certain records you’re interested in.  You can highlight as many records as you want.  It’s possible that we may add other functionality to this, e.g. the option to download the data for selected records.  Here’s a screenshot of the text view of the page:

I also completed the ‘image and text’ view.  This works best on a large screen (i.e. not a mobile phone, although it is just about possible to use it on one, as I did test this out).  The image takes up about 60% of the screen width and the text takes up the remaining 40%.   The height of the records section is fixed to the height of the image area and is scrollable, so you can scroll down the records whilst still viewing the image (rather than the whole page scrolling and the image disappearing off the screen).  I think this view works really well and the records are still perfectly usable in the more confined area and it’s great to be able to compare the image and the text side by side.  Here’s a screenshot of the same page when viewing both text and image:

I tested the new interface out with registers from all of our available libraries and everything is looking good to me.  Some registers don’t have images yet, so I added in a check for this to ensure that the image views and page thumbnails don’t appear for such registers.  After that I moved onto developing the interface to browse book holdings when viewing a library.  I created an API endpoint for returning all of the data associated with holding records for a specified library.  This includes all of the book holding data, information about each of the book items associated with the holding record (including the number of borrowing records for each), the total number of borrowing records for the holding, any associated book edition and book work records (and there may be multiple editions associated with each holding) plus any authors associated with the book.  Authors cascade down through the record as they do when viewing borrowing records in the page.  This is a gigantic amount of information, especially as libraries may have many thousands of book holding records.  The API call loads pretty rapidly for smaller libraries (e.g. Chambers Library with 961 book holding records) but for larger ones (e.g. St Andrews with over 8,500 book holding records) the API call takes too long to return the data (in the latter case it takes about a minute and returns a JSON file that’s over 6Mb in size).  The problem is the data needs to be returned in full in order to do things like order it by largest number of borrowings.  Clearly dynamically generating the data each time is going to be too slow so instead I am going to investigate caching the data.  For example, that 6Mb JSON file can just site there as an actual file rather than being generated each time.  Instead I will write a script to regenerate the cached files and I can run this whenever data gets updated (or maybe once a week whilst the project is still active).  I’ll continue to work on this next week.

Week Beginning 26th September 2022

I spent most of my time this week getting back into the development of the front-end for the Books and Borrowing project.  It’s been a long time since I was able to work on this due to commitments to other projects and also due to there being a lot more for me to do than I was expecting regarding processing images and generating associated data in the project’s content management system over the summer.  However, I have been able to get back into the development of the front-end this week and managed to make some pretty good progress.  The first thing I did was to make some changes to the ‘libraries’ page based on feedback I received ages ago from the project’s Co-I Matt Sangster.  The map of libraries used clustering to group libraries that are close together when the map is zoomed out, but Matt didn’t like this.  I therefore removed the clusters and turned the library locations back into regular individual markers.  However, it is now rather difficult to distinguish the markers for a number of libraries.  For example, the markers for Glasgow and the Hunterian libraries (back when the University was still on the High Street) are on top of each other and you have to zoom in a very long way before you can even tell there are two markers there.

I also updated the tabular view of libraries.  Previously the library name was a button that when clicked on opened the library’s page.  Now the name is text and there are two buttons underneath.  The first one opens the library page while the second pans and zooms the map to the selected library, whilst also scrolling the page to the top of the map.  This uses Leaflet’s ‘flyTo’ function which works pretty well, although the map tiles don’t quite load in fast enough for the automatic ‘zoom out, pan and zoom in’ to proceed as smoothly as it ought to.

After that I moved onto the library page, which previously just displayed the map and the library name. I updated the tabs for the various sections to display the number of registers, books and borrowers that are associated with the library.  The Introduction page also now features the information recorded about the library that has been entered into the CMS.  This includes location information, dates, links to the library etc.  Beneath the summary info there is the map, and beneath this is a bar chart showing the number of borrowings per year at the library.  Beneath the bar chart you can find the longer textual fields about the library such as descriptions and sources.  Here’s a screenshot of the page for St Andrews:

I also worked on the ‘Registers’ tab, which now displays a tabular list of the selected library’s registers, and I also ensured that when you select one of the tabs other than ‘Introduction’ the page automatically scrolls down to the top of the tabs to avoid the need to manually scroll past the header image (but we still may make this narrower eventually).  The tabular list of registers can be ordered by any of the columns and includes data on the number of pages, borrowers, books and borrowing records featured in each.

When you open a register the information about it is displayed (e.g. descriptions, dates, stats about the number of books etc referenced in the register) and large thumbnails of each page together with page numbers and the number of records on each page are displayed.  The thumbnails are rather large and I could make them smaller, but doing so would mean that all the pages end up looking the same – beige rectangles.  The thumbnails are generated on the fly by the IIIF server and the first time a register is loaded it can take a while for the thumbnails to load in.  However, generated thumbnails are then cached on the server so subsequent page loads are a lot quicker.  Here’s a screenshot of a register page for St Andrews:

One thing I also did was write a script to add in a new ‘pageorder’ field to the ‘page’ database table.  I then wrote a script that generated the page order for every page in every register in the system.  This picks out the page that has no preceding page and iterates through pages based on the ‘next page’ ID.  Previously pages in lists were ordered by their auto-incrementing ID, but this meant that if new pages needed to be inserted for a register they ended up stuck at the end of the list, even though the ‘next’ and ‘previous’ links worked successfully.  This new ‘pageorder’ field ensures lists of pages are displayed in the proper order.  I’ve updated the CMS to ensure this new field is used when viewing a register, although I haven’t as of yet updated the CMS to regenerate the ‘pageorder’ for a register if new pages are added out of sequence.  For now if this happens I’ll need to manually run my script again to update things.

Anyway, back to the front-end:  The new ‘pageorder’ is used in the list of pages mentioned above so the thumbnails get displaying in the correct order.  I may add pagination to this page, as all of the thumbnails are currently on one page and it can take a while to load, although these days people seem to prefer having long pages rather than having data split over multiple pages.

The final section I worked on was the page for viewing an actual page of the register, and this is still very much in progress.  You can open a register page by pressing on its thumbnail and currently you can navigate through the register using the ‘next’ and ‘previous’ buttons or return to the list of pages.  I still need to add in a ‘jump to page’ feature here too.  As discussed in the requirements document, there will be three views of the page: Text, Image and Text and Image side-by-side.  Currently I have implemented the image view only.  Pressing on the ‘Image view’ tab opens a zoomable / pannable interface through which the image of the register page can be viewed.  You can also make this interface full screen by pressing on the button in the top right.  Also, if you’re viewing the image and you use the ‘next’ and ‘previous’ navigation links you will stay on the ‘image’ tab when other pages load.  Here’s a screenshot of the ‘image view’ of the page:

Also this week I wrote a three-page requirements document for the redevelopment of the front-ends for the various place-names projects I’ve created using the system originally developed for the Berwickshire place-names project which launched back in 2018.  The requirements document proposes some major changes to the front-end, moving to an interface that operates almost entirely within the map and enabling users to search and browse all data from within the map view rather than having to navigate to other pages.  I sent the document off to Thomas Clancy, for whom I’m currently developing the systems for two place-names projects (Ayr and Iona) and I’ll just need to wait to hear back from him before I take things further.

I also responded to a query from Marc Alexander about the number of categories in the Thesaurus of Old English, investigated a couple of server issues that were affecting the Glasgow Medical Humanities site, removed all existing place-name elements from the Iona place-names CMS so that the team can start afresh and responded to a query from Eleanor Lawson about the filenames of video files on the Seeing Speech site.  I also made some further tweaks to the Speak For Yersel resource ahead of its launch next week.  This included adding survey numbers to the survey page and updating the navigation links and writing a script that purges a user and all related data from the system.  I ran this to remove all of my test data from the system.  If we do need to delete a user in future (either because their data is clearly spam or a malicious attempt to skew the results, or because a user has asked us to remove their data) I can run this script again.  I also ran through every single activity on the site to check everything was working correctly.  The only thing I noticed is that I hadn’t updated the script to remove the flags for completed surveys when a user logs out, meaning after logging out and creating a new user the ticks for completed surveys were still displaying.  I fixed this.

I also fixed a few issues with the Burns mini-site about Kozeluch, including updating the table sort options which had stopped working correctly when I added a new column to the table last week and fixing some typos with the introductory text.  I also had a chat with the editor of the Anglo-Norman Dictionary about future developments and responded to a query from Ann Ferguson about the DSL bibliographies.  Next week I will continue with the B&B developments.

Week Beginning 19th September 2022

It was a four-day week this week due to the Queen’s funeral on Monday.  I divided my time for the remaining four days over several projects.  For Speak For Yersel I finally tackled the issue of the way maps are loaded.  The system had been developed for a map to be loaded afresh every time data is requested, with any existing map destroyed in the process.  This worked fine when the maps didn’t contain demographic filters as generally each map only needed to be loaded once and then never changed until an entirely new map was needed (e.g. for the next survey question).  However, I was then asked to incorporate demographic filters (age groups, gender, education level), with new data requested based on the option the user selected.  This all went through the same map loading function, which still destroyed and reinitiated the entire map on each request.  This worked, but wasn’t ideal, as it meant the map reset to its default view and zoom level whenever you changed an option, map tiles were reloaded from the server unnecessarily and if the user was in ‘full screen’ mode they were booted out of this as the full screen map no longer existed.  For some time I’ve been meaning to redevelop this to address these issues, but I’ve held off as there were always other things to tackled and I was worried about essentially ripping apart the code and having to rebuilt fundamental aspects of it.  This week I finally plucked up the courage to delve into the code.

I created a test version of the site so as to not risk messing up the live version and managed to develop an updated method of loading the maps.  This method initiates the map only once when a page is first loaded rather than destroying and regenerating the map every time a new question is loaded or demographic data is changed.  This means the number of map tile loads is greatly reduced as the base map doesn’t change until the user zooms or pans.  It also means the location and zoom level a user has left the map on stays the same when the data is changed.  For example, if they’re interested in Glasgow and are zoomed in on it they can quickly flick between different demographic settings and the map will stay zoomed in on Glasgow rather than resetting each time.  Also, if you’re viewing the map in full-screen mode you can now change the demographic settings without the resource exiting out of full screen mode.

All worked very well, with the only issues being that the transitions between survey questions and quiz questions weren’t as smooth as the with older method.  Previously the map scrolled up and was then destroyed, then a new map was created and the data was loaded into the area before it smoothly scrolled down again.  For various technical reasons this no longer worked quite as well any more.  The map area still scrolls up and down, but the new data only populates the map as the map area scrolls down, meaning for a brief second you can still see the data and legend for the previous question before it switches to the new data.  However, I spent some further time investigating this issue and managed to fix it, with different fixes required for the survey and the quiz.  I also noticed a bug whereby the map would increase in size to fit the available space but the map layers and data were not extending properly into the newly expanded area.  This is a known issue with Leaflet maps that have their size changed dynamically and there’s actually a Leaflet function that sorts it – I just needed to call map.invalidateSize(); and the map worked properly again.  Of course it took a bit of time to figure this simple fix out.

I also made some further updates to the site.  Based on feedback about the difficulty some people are having about which surveys they’ve done, I updated the site to log when the user completes a survey.  Now when the user goes to the survey index page a count of the number of surveys they’ve completed is displayed in the top right and a green tick has been added to the button of each survey they have completed.  Also, when they reach the ‘what next’ page for a survey a count of their completed survey is also shown.  This should make it much easier for people to track what they’ve done.  I also made a few small tweaks to the data at the request of Jennifer, and create a new version of the animated GIF that has speech bubbles, as the bubble for Shetland needed its text changed.  As I didn’t have the files available I took the opportunity regenerate the GIF, using a larger map, as the older version looked quite fuzzy on a high definition screen like an iPad.  I kept the region outlines on as well to tie it in better with our interactive maps.  Also the font used in the new version is now the ‘Baloo’ font we use for the site.  I stored all of the individual frames both as images and as powerpoint slides so I can change them if required.  For future reference, I created the animated GIF using https://ezgif.com/maker with a 150 second delay between slides, crossfade on and a fader delay of 8.

Also this week I researched an issue with the Scots Thesaurus that was causing the site to fail to load.  The WordPress options table had become corrupted and unreadable and needed to be replaced with a version from the backups, which thankfully fixed things.  I also did my expenses from the DHC in Sheffield, which took longer than I thought it would, and made some further tweaks to the Kozeluch mini-site on the Burns C21 website.  This included regenerating the data from a spreadsheet via a script I’d written and tweaking the introductory text.  I also responded to a request from Fraser Dallachy to regenerate some data that a script Id’ previously written had outputted.  I also began writing a requirements document for the redevelopment of the place-names project front-ends to make them more ‘map first’.

I also did a bit more work for Speech Star, making some changes to the database of non-disordered speech and moving the ‘child speech error database’ to a new location.  I also met with Luca to have a chat about the BOSLIT project, its data, the interface and future plans.  We had a great chat and I then spent a lot of Friday thinking about the project and formulating some feedback that I sent in a lengthy email to Luca, Lorna Hughes and Kirsteen McCue on Friday afternoon.

Week Beginning 12th September 2022

I spent a bit of time this week going through my notes from the Digital Humanities Congress last week and writing last week’s lengthy post.  I also had my PDR session on Friday and I needed to spend some time preparing for this, writing all of the necessary text and then attending the session.  It was all very positive and it was a good opportunity to talk to my line manager about my role.  I’ve been in this job for ten years this month and have been writing these blog posts every working week for those ten years, which I think is quite an achievement.

In terms of actual work on projects, it was rather a bitty week, with my time spread across lots of different projects.  On Monday I had a Zoom call for the VariCS project, a phonetics project in collaboration with Strathclyde that I’m involved with.  The project is just starting up and this was the first time the team had all met.  We mainly discussed setting up a web presence for the project and I gave some advice on how we could set up the website, the URL and such things.  In the coming weeks I’ll probably get something set up for the project.

I then moved onto another Burns-related mini-project that I worked on with Kirsteen McCue many months ago – a digital edition of Koželuch’s settings of Robert Burns’s Songs for George Thomson.  We’re almost ready to launch this now and this week I created a page for an introductory essay, migrated a Word document to WordPress to fill the page, including adding in links and tweaking the layout to ensure things like quotes displayed properly.  There are still some further tweaks that I’ll need to implement next week, but we’re almost there.

I also spent some time tweaking the Speak For Yersel website, which is now publicly accessible (https://speakforyersel.ac.uk/) but still not quite finished.  I created a page for a video tour of the resource and made a few tweaks to the layout, such as checking the consistency of font sizes used throughout the site.  I also made some updates to the site text and added in some lengthy static content to the site in the form or a teachers’ FAQ and a ‘more information’ page.  I also changed the order of some of the buttons shown after a survey is completed to hopefully make it clearer that other surveys are available.

I also did a bit of work for the Speech Star project.  There had been some issues with the Central Scottish Phonetic Features MP4s playing audio only on some operating systems and the replacements that Eleanor had generated worked for her but not for me.  I therefore tried uploading them to and re-downloading them from YouTube, which thankfully seemed to fix the issue for everyone.  I then made some tweaks to the interfaces to the two project websites.  For the public site I made some updates to ensure the interface looked better on narrow screens, ensuring changing the appearance of the ‘menu’ button and making the logo and site header font smaller to they take up less space.  I also added an introductory video to the homepage too.

For the Books and Borrowing project I processed the images for another library register.  This didn’t go entirely smoothly.  I had been sent 73 images and these were all upside down so needed rotating.  It then transpired that I should have been sent 273 images so needed to chase up the missing ones.  Once I’d been sent the full set I was then able to generate the page images for the register, upload the images and associate them with the records.

I then moved on to setting up the front-end for the Ayr Place-names website.  In the process of doing so I became aware that one of the NLS map layers that all of our place-name projects use had stopped working.  It turned out that the NLS had migrated this map layer to a third party map tile service (https://www.maptiler.com/nls/) and the old URLs these sites were still using no longer worked.  I had a very helpful chat with Chris Fleet at NLS Maps about this and he explained the situation.  I was able to set up a free account with the maptiler service and update the URLS in four place-names websites that referenced the layer (https://berwickshire-placenames.glasgow.ac.uk/, https://kcb-placenames.glasgow.ac.uk/, https://ayr-placenames.glasgow.ac.uk and https://comparative-kingship.glasgow.ac.uk/scotland/).  I’ll need to ensure this is also done for the two further place-names projects that are still in development (https://mull-ulva-placenames.glasgow.ac.uk and https://iona-placenames.glasgow.ac.uk/).

I managed to complete the work on the front-end for the Ayr project, which was mostly straightforward as it was just adapting what I’d previously developed for other projects.  The thing that took the longest was getting the parish data and the locations where the parish three-letter acronyms should appear, but I was able to get this working thanks to the notes I’d made the last time I needed to deal with parish boundaries (as documented here: https://digital-humanities.glasgow.ac.uk/2021-07-05/.  After discussions with Thomas Clancy about the front-end I decided that it would be a good idea to redevelop the map-based interface to display al of the data on the map by default and to incorporate all of the search and browse options within the map itself.  This would be a big change, and it’s one I had been thinking of implementing anyway for the Iona project, but I’ll try and find some time to work on this for all of the place-name sites over the coming months.

Finally, I had a chat with Kirsteen McCue and Luca Guariento about the BOSLIT project.  This project is taking the existing data for the Bibliography of Scottish Literature in Translation (available on the NLS website here: https://data.nls.uk/data/metadata-collections/boslit/) and creating a new resource from it, including visualisations.  I offered to help out with this and will be meeting with Luca to discuss things further, probably next week.