Week Beginning 21st November 2022

I participated in the UCU strike action on Thursday and Friday this week, so it was a three-day week for me.  I spent much of this time researching RDF, SKOS and OWL semantic web technologies in an attempt to understand them better in the hope that they might be of some use for future thesaurus projects.  I’ll be giving a paper about this at a workshop in January so I’m not going to say too much about my investigations here.  There is a lot to learn about, however, and I can see me spending quite a lot more time on this in the coming weeks.

Other than this I returned to working on the Anglo-Norman Dictionary.  I added in a line of text that had somehow been omitted from one of the Textbase XML files and added in a facility to enable project staff to delete an entry from the dictionary.  In reality this just deactivates the entry, removing it from the front-end but still keeping a record of it in the database in case the entry needs to be reinstated.  I also spoke to the editor about some proposed changes to the dictionary management system and begin to think about how these new features will function and how they will be developed.

For the Books and Borrowing project I had a chat with IT Services at Stirling about setting up an Apache Solr system for the project.  It’s looking like we will be able to proceed with this option, which will be great.  I also had a chat with Jennifer Smith about the new Speak For Yersel project areas.  It looks like I’ll be creating the new resources around February next year.  I also fixed an issue with the Place-names of Iona data export tool and discussed a new label that will be applied to data for the ‘About’ box for entries in the Dictionaries of the Scots Language.

I also prepared for next week’s interview panel and engaged in a discussion with IT Services about the future of the servers that are hosted in the College of Arts.

Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.

Week Beginning 31st October 2022

I spent a lot of the week continuing to work on the Books and Borrowing front end.  To begin with I worked on the ‘borrowers’ tab in the ‘library’ page and created an initial version of it.  Here’s an example of how it looks:

As with books, the page lists borrowers alphabetically, in this case by borrower surname.  Letter tabs and counts of the number of borrowers with surnames beginning with the letter appear at the top and you can select a letter to view all borrowers with surnames beginning with the letter.  I had to create a couple of new fields in the borrower table to speed the querying up, saving the initial letter of each borrower’s surname and a count of their borrowings.

The display of borrowers is similar to the display of books, with each borrower given a box that you can press on to highlight.  Borrower ID appears in the top right and each borrower’s full name appears as a green title.  The name is listed as it would be read, but this could be updated if required.  I’m not sure where the ‘other title’ field would go if we did this, though – presumably something like ‘Macdonald, Mr Archibald of Sanda’.

The full information about a borrower is listed in the box, including additional fields and normalised occupations.  Cross references to other borrowers also appear.  As with the ‘Books’ tab, much of this data will be linked to search results once I’ve created the search options (e.g. press on an occupation to view all borrowers with this occupation, press on the number of borrowings to view the borrowings) but this is not in place yet.  You can also change the view from ‘surname’ to ‘top 100 borrowers’, which lists the top 100 most prolific borrowers (or less if there are less than 100 borrowers in the library).  As with the book tab, a number appears at the top left of each record to show the borrower’s place on the ‘hitlist’ and the number of borrowings is highlighted in red to make it easier to spot.

I also fixed some issues with the book and author caches that were being caused by spaces at the start of fields and author surnames beginning with a non-capitalised letter (e.g. ‘von’) which was messing things up as the cache generation script was previously only matching upper case, meaning ‘v’ wasn’t getting added to ‘V’.  I’ve regenerated the cache to fix this.

I then decided to move onto the search rather than the ‘Facts & figures’ tab as I reckoned this should be prioritised.  I began work on the quick search initially, and I’m still very much in the middle of this.  The quick search has to search an awful lot of and to do this several different queries need to be run.  I’ll need to see how this works in terms of performance as I fear the ‘quick’ search risks being better named the ‘slow’ search.

We’ve stated that users will be able to search for dates in the quick search and these need to be handled differently.  For now the API checks to see whether the passed search string is a date by running a pattern match on the string.  This converts all numbers in the string into an ‘X’ character and then checks to see whether the resulting string matches a valid date form.  For the API I’m using a bar character (|) to designate a ranged date and a dash to designate a division between day, month and year.  I can’t use a slash (/) as the search string is passed in the URL and slashes have meaning in URLs.  For info, here are the valid date string patterns:

“XXXX”,”XXXX-XX”,”XXXX-XX-XX”,”XXXX|XXXX”,”XXXX|XXXX-XX”,”XXXX|XXXX-XX-XX”,”XXXX-XX|XXXX”,”XXXX-XX|XXXX-XX”,”XXXX-XX|XXXX-XX-XX”,”XXXX-XX-XX|XXXX”,”XXXX-XX-XX|XXXX-XX”,”XXXX-XX-XX|XXXX-XX-XX”

So for example, if someone searches for ‘1752’ or ‘1752-03’ or ‘1752-02|1755-07-22’ the system will recognise these as a date search and process them accordingly.  I should point out that I can and probably will get people to enter dates in a more typical way in the front-end, using slashes between day, month and year and a dash between ranged dates (e.g. ‘1752/02-1755/07/22’) but I’ll convert these before passing the string to the API in the URL.

I have the query running to search the dates, and this in itself was a bit complicated to generate as including a month or a month and a day in a ranged query changes the way the query needs to work.  E.g. if the user searches for ‘1752-1755’ then we need to return all borrowing records with a borrowed year of ‘1752’ or later and ‘1755’ or earlier.  However, if the query is ‘1752/06-1755-03’ then the query can’t just be ‘all borrowed records with a borrowed year of ‘1752’ or later and a borrowed month of ‘06’ or later and a borrowed year of ‘1755’ or earlier and a borrowed month of ‘03’ or earlier as this would return no results.  This is because the query is looking to return borrowings with a borrowed month of ‘06’ or later and also ‘03’ or earlier.  Instead the query needs to find borrowing records that have a borrowed year of 1752 AND a borrowed month of ‘06’ or later OR have a borrowed year later than 1752 AND have a borrowed year of 1755 AND a borrowed month of ‘03’ or earlier OR have a borrowed year earlier than 1755.

I also have the queries running that search for all necessary fields that aren’t dates.  This currently requires five separate queries to be run to check fields like author names, borrower occupations, book edition fields such as ESTC etc.  The queries currently return a list of borrowing IDs, and this is as far as I’ve got.  I’m wondering now whether I should create a cached table for the non-date data queried by the quick search, consisting of a field for the borrowing ID and a field for the term that needs to be searched, with each borrowing having many rows depending on the number of terms they have (e.g. a row for each occupation of every borrower associated with the borrowing, a row for each author surname, a row for each forename, a row for ESTC).  This should make things much speedier to search, but will take some time to generate.  I’ll continue to investigate this next week.

Also this week I updated the structure of the Speech Star database to enable each prompt to have multiple sounds etc.  I had to update the non-disordered page and the child error page to work with the new structure, but it seems to be working.  I also had to update the ‘By word’ view as previously sound, articulation and position were listed underneath the word and above the table.  As these fields may now be different for each record in the table I’ve removed the list and have instead added the data as columns to the table.  This does however mean that the table contains a lot of identical data for many of the rows now.

I then added in tooptip / help text containing information about what the error types mean in the child speech error database.  On the ‘By Error Type’ page the descriptions currently appear as small text to the right of the error type title.  On the ‘By Word’ page  the error type column has an ‘i’ icon after the error type.  Hovering over or pressing on this displays a tooltip with the error description, as you can see in the following screenshot:

I also updated the layout of the video popups to split the metadata across two columns and also changed the order of the errors on the ‘By error type’ page so that the /r/ errors appear in the correct alphabetical order for ‘r’ rather than appearing first due to the text beginning with a slash.  With this all in place I then replicated the changes on the version of the site that is going to be available via the Seeing Speech URL.

Kirsteen McCue contacted me last week to ask for advice on a British Academy proposal she’s putting together and after asking some questions about the project I wrote a bit of text about the data and its management for her.  I also sorted out my flights and accommodation for the workshop I’m attending in Zurich in January and did a little bit of preparation for a session on Digital Humanities that Luca Guariento has organised for next Monday.  I’ll be discussing a couple of projects at this event.  I also exported all of the Speak For Yersel survey data and sent this to a researcher who is going to do some work with the data and fixed an issue that had cropped up with the Place-names of Kirkcudbright website.  I also spent a bit of time on DSL duties this week, helping with some user account access issues and discussing how links will be added from entries to the lengthy essay on the Scots Language that we host on the site.

 

Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.

Week Beginning 26th September 2022

I spent most of my time this week getting back into the development of the front-end for the Books and Borrowing project.  It’s been a long time since I was able to work on this due to commitments to other projects and also due to there being a lot more for me to do than I was expecting regarding processing images and generating associated data in the project’s content management system over the summer.  However, I have been able to get back into the development of the front-end this week and managed to make some pretty good progress.  The first thing I did was to make some changes to the ‘libraries’ page based on feedback I received ages ago from the project’s Co-I Matt Sangster.  The map of libraries used clustering to group libraries that are close together when the map is zoomed out, but Matt didn’t like this.  I therefore removed the clusters and turned the library locations back into regular individual markers.  However, it is now rather difficult to distinguish the markers for a number of libraries.  For example, the markers for Glasgow and the Hunterian libraries (back when the University was still on the High Street) are on top of each other and you have to zoom in a very long way before you can even tell there are two markers there.

I also updated the tabular view of libraries.  Previously the library name was a button that when clicked on opened the library’s page.  Now the name is text and there are two buttons underneath.  The first one opens the library page while the second pans and zooms the map to the selected library, whilst also scrolling the page to the top of the map.  This uses Leaflet’s ‘flyTo’ function which works pretty well, although the map tiles don’t quite load in fast enough for the automatic ‘zoom out, pan and zoom in’ to proceed as smoothly as it ought to.

After that I moved onto the library page, which previously just displayed the map and the library name. I updated the tabs for the various sections to display the number of registers, books and borrowers that are associated with the library.  The Introduction page also now features the information recorded about the library that has been entered into the CMS.  This includes location information, dates, links to the library etc.  Beneath the summary info there is the map, and beneath this is a bar chart showing the number of borrowings per year at the library.  Beneath the bar chart you can find the longer textual fields about the library such as descriptions and sources.  Here’s a screenshot of the page for St Andrews:

I also worked on the ‘Registers’ tab, which now displays a tabular list of the selected library’s registers, and I also ensured that when you select one of the tabs other than ‘Introduction’ the page automatically scrolls down to the top of the tabs to avoid the need to manually scroll past the header image (but we still may make this narrower eventually).  The tabular list of registers can be ordered by any of the columns and includes data on the number of pages, borrowers, books and borrowing records featured in each.

When you open a register the information about it is displayed (e.g. descriptions, dates, stats about the number of books etc referenced in the register) and large thumbnails of each page together with page numbers and the number of records on each page are displayed.  The thumbnails are rather large and I could make them smaller, but doing so would mean that all the pages end up looking the same – beige rectangles.  The thumbnails are generated on the fly by the IIIF server and the first time a register is loaded it can take a while for the thumbnails to load in.  However, generated thumbnails are then cached on the server so subsequent page loads are a lot quicker.  Here’s a screenshot of a register page for St Andrews:

One thing I also did was write a script to add in a new ‘pageorder’ field to the ‘page’ database table.  I then wrote a script that generated the page order for every page in every register in the system.  This picks out the page that has no preceding page and iterates through pages based on the ‘next page’ ID.  Previously pages in lists were ordered by their auto-incrementing ID, but this meant that if new pages needed to be inserted for a register they ended up stuck at the end of the list, even though the ‘next’ and ‘previous’ links worked successfully.  This new ‘pageorder’ field ensures lists of pages are displayed in the proper order.  I’ve updated the CMS to ensure this new field is used when viewing a register, although I haven’t as of yet updated the CMS to regenerate the ‘pageorder’ for a register if new pages are added out of sequence.  For now if this happens I’ll need to manually run my script again to update things.

Anyway, back to the front-end:  The new ‘pageorder’ is used in the list of pages mentioned above so the thumbnails get displaying in the correct order.  I may add pagination to this page, as all of the thumbnails are currently on one page and it can take a while to load, although these days people seem to prefer having long pages rather than having data split over multiple pages.

The final section I worked on was the page for viewing an actual page of the register, and this is still very much in progress.  You can open a register page by pressing on its thumbnail and currently you can navigate through the register using the ‘next’ and ‘previous’ buttons or return to the list of pages.  I still need to add in a ‘jump to page’ feature here too.  As discussed in the requirements document, there will be three views of the page: Text, Image and Text and Image side-by-side.  Currently I have implemented the image view only.  Pressing on the ‘Image view’ tab opens a zoomable / pannable interface through which the image of the register page can be viewed.  You can also make this interface full screen by pressing on the button in the top right.  Also, if you’re viewing the image and you use the ‘next’ and ‘previous’ navigation links you will stay on the ‘image’ tab when other pages load.  Here’s a screenshot of the ‘image view’ of the page:

Also this week I wrote a three-page requirements document for the redevelopment of the front-ends for the various place-names projects I’ve created using the system originally developed for the Berwickshire place-names project which launched back in 2018.  The requirements document proposes some major changes to the front-end, moving to an interface that operates almost entirely within the map and enabling users to search and browse all data from within the map view rather than having to navigate to other pages.  I sent the document off to Thomas Clancy, for whom I’m currently developing the systems for two place-names projects (Ayr and Iona) and I’ll just need to wait to hear back from him before I take things further.

I also responded to a query from Marc Alexander about the number of categories in the Thesaurus of Old English, investigated a couple of server issues that were affecting the Glasgow Medical Humanities site, removed all existing place-name elements from the Iona place-names CMS so that the team can start afresh and responded to a query from Eleanor Lawson about the filenames of video files on the Seeing Speech site.  I also made some further tweaks to the Speak For Yersel resource ahead of its launch next week.  This included adding survey numbers to the survey page and updating the navigation links and writing a script that purges a user and all related data from the system.  I ran this to remove all of my test data from the system.  If we do need to delete a user in future (either because their data is clearly spam or a malicious attempt to skew the results, or because a user has asked us to remove their data) I can run this script again.  I also ran through every single activity on the site to check everything was working correctly.  The only thing I noticed is that I hadn’t updated the script to remove the flags for completed surveys when a user logs out, meaning after logging out and creating a new user the ticks for completed surveys were still displaying.  I fixed this.

I also fixed a few issues with the Burns mini-site about Kozeluch, including updating the table sort options which had stopped working correctly when I added a new column to the table last week and fixing some typos with the introductory text.  I also had a chat with the editor of the Anglo-Norman Dictionary about future developments and responded to a query from Ann Ferguson about the DSL bibliographies.  Next week I will continue with the B&B developments.

Week Beginning 22nd August 2022

I continued to spend a lot of my time working on the Speak For Yersel project this week.  We had a team meeting on Monday at which we discussed the outstanding tasks and particularly how I was going to tackle converting the quiz questions into dynamic answers.  Previously the quiz question answers were static, which will not work well as the maps the users will reference in order to answer a question are dynamic, meaning the correct answer may evolve over time.  I had proposed a couple of methods that we could use to ensure that the answers were dynamically generated based on the currently available data and we finalised our approach today.

Although I’d already made quite a bit of progress with my previous test scripts, there was still a lot to do to actually update the site.  I needed to update the structure of the database, the script that outputs the data for use in the site, the scripts that handle the display of questions and the evaluation of answers, and the scripts that store a user’s selected answers.

Changes to the database allow for dynamic quiz questions to be stored (non-dynamic ones have fixed ‘answer options’ but dynamic ones don’t).  Changes also allow for references to the relevant answer option of the survey question the quiz question is about to be stored (e.g. that the quiz is about the ‘mother’ map and specifically about the use of ‘mam’).  I made significant updates to the script that outputs data for use in the site to integrate the functions from my earlier test script that calculated the correct answer.  I updated these functions to change the logic somewhat.  They now only use ‘method 1’ as mentioned in an earlier post.  This method also now has a built-in check to filter out regions that have the highest percentage of usage but only a limited amount of data.  Currently this is set to a minimum of 10 answers for the option in question (e.g. ‘mam’) rather than total number of answers in a region.  Regions are ordered by their percentage usage (highest first) and the script iterates down through the regions and will pick as ‘correct’ the first one that has at least 10 answers.  I’ve also added in a contingency in cases where none of the regions have at least 10 answers (currently the case for the ‘rocket’ question).  In such cases the region marked as ‘correct’ will be the one that has the highest raw count of answers for the answer option rather than the highest percentage.

With the ‘correct’ region picked out the script then picks out all other regions where the usage percentage is at least 10% lower than the correct percentage.  This is to ensure that there isn’t an ‘incorrect’ answer that is too similar to the ‘correct’ one.  If this results in less than three regions (as regions are only returned if they have clicks for the answer option) then the system goes through the remaining regions and adds these in with a zero percentage.  These ‘incorrect’ regions are then shuffled and three are picked out at random.  The ‘correct’ answer is then added to these three and the options are shuffled again to ensure the ‘correct’ option is randomly positioned.  The dynamically generated output is then plugged into the output script that the website uses.

I then updated the front-end to work with this new data.  This also required me to create a new database table to hold the user’s answers, storing the region the user presses on and whether their selection was correct, along with the question ID and the person ID.  Non-dynamic answers store the ID of the ‘answer option’ that the user selected, but these dynamic questions don’t have static ‘answer options’ so the structure needed to be different.

I then implemented the dynamic answers for the ‘most of Scotland’ questions.  For these questions the script needs to evaluate whether a form is used throughout Scotland or not.  The algorithm gets all of the answer options for the survey question (e.g. ‘crying’ and ‘greetin’) and for each region works out the percentage of responses for each option.  The team had previously suggested a fixed percentage threshold of 60%, but I reckoned it might be better for the threshold to change depending on how many answer options there are.  Currently I’ve set the threshold to be 100 divided by the number of options.  So where there are two options the threshold is 50%.  Where there are four options (e.g. the ‘wean’ question) the threshold is 25% (i.e. if 25% or more of the answers in a region are for ‘wean’ it is classed as present in the region).  Where there are three options (e.g. ‘clap’) the threshold is 33%.  Where there are 5 options (e.g. ‘clarty’) the threshold is 20%.

The algorithm counts the number of regions that meet the threshold, and if the number is 8 or more then the term is considered to be found throughout Scotland and ‘Yes’ is the correct answer.  If not then ‘No’ is the correct answer.  I also had to update the way answers are stored in the database so these yes/no answers can be saved (as they have no associated region like the other questions).

I then moved onto tackling the non-standard (in terms of structure) questions to ensure they are dynamically generated as well.  These were rather tricky to do as they each had to be handled differently as they were asking different things of the data (e.g. a question like ‘What are you likely to call the evening meal if you live in Tayside and Angus (Dundee) and didn’t go to Uni?’).  I also made the ‘Sounds about right’ quiz dynamic.

I then moved onto tackling the ‘I would never say that’ quiz, which has been somewhat tricky to get working as the structure of the survey questions and answers is very different.  Quizzes for the other surveys involved looking at a specific answer option but for this survey the answer options are different rating levels that each need to be processed and handled differently.

For this quiz for each region the system returns the number of times each rating level has been selected and works out the percentages for each.  It then adds the ‘I’ve never heard this’ and ‘people elsewhere say this’ percentages together as a ‘no’ percentage and adds the ‘people around me say this’ and ‘I’d say this myself’ percentages together as a ‘yes’ percentage.  Currently there is no weighting but we may want to consider this (e.g. ‘I’d say this’ would be worth more than ‘people around me’).

With these ratings stored the script handled question types differently.  For the ‘select a region’ type of question the system works in a similar way to the other quizzes:  It sorts the regions by ‘yes’ percentage with the biggest first.  It then iterates through the regions and picks as the correct answer the first it comes to where the total number of responses for the region is the same or greater than the minimum allowed (currently set to 10).  Note that this is different to the other quizzes where this check for 10 is made against the specific answer option rather than the number of responses in the region as a whole.

If no region passes the above check then the region with the highest ‘yes’ percentage without a minimum allowed check is chosen as the correct answer.  The system then picks out all other regions with data where the ‘yes’ percentage is at least 10% lower than the correct answer, adds in regions with no data if less than three have data, shuffles the regions and picks out three.  These are then added to the ‘correct’ region and the answers are shuffled again.

I changed the questions that had an ‘all over Scotland’ answer option so that these are now ‘yes/no’ questions, e.g. ‘Is ‘Are you wanting to come with me?’ heard throughout most of Scotland?’.  For these questions the system uses 8 regions as the threshold, as with the other quizzes.  However, the percentage threshold for ‘yes’ is fixed.  I’ve currently set this to 60% (i.e. at least 60% of all answers in a region are either ‘people around me say this’ or ‘I’d say this myself’).  There is currently no minimum number of responses limit for this question type, so a region with 1 single answer that’s ‘people around me say this’ will have a 100% ‘yes’ and the region will included.  This is also the case for the ‘most of Scotland’ questions in the other quizzes, as we may need to tweak this.

As we’re using percentages rather than exact number of dots the questions can sometimes be a bit tricky.  For example the first question currently has Glasgow as the correct answer because all but two of the markers in this region are ‘people around me say this’ or ‘I’d say this myself’.  But if you turn off the other two categories and just look at the number of dots you might surmise that the North East is the correct answer as there are more dots there, even though proportionally fewer of them are the high ratings.  I don’t know if we can make it clearer that we’re asking which region has proportionally more higher ratings without confusing people further, though.

I also spent some time this week working on the Book and Borrowing project.  I had to make a few tweaks to the Chambers map of borrowers to make the map work better on smaller screens.  I ensured that both the ‘Map options’ section on the left and the ‘map legend’ on the right are given a fixed height that is shorter than the map and the areas become scrollable, as I’d noticed that on short screens both these areas could end up longer than the map and therefore their lower parts were inaccessible.  I’ve also added a ‘show/hide’ button to the map legend, enabling people to hide the area if it obscures their view of the map.

I also sent on some renamed library register files from St Andrews to Gerry for him to align with existing pages in the CMS, replaced some of the page images for the Dumfries register and renamed and uploaded images for a further St Andrews register that already existed in the CMS, ensuring the images became associated with the existing pages.

I started to work on the images for another St Andrews register that already exists in the system, but for this one the images are a double page spread so I need to merge two pages into one in the CMS.  The script needs to find all odd numbered pages then move the records on these to the preceding even numbered page, and at the same time regenerate the ‘page order’ for each record so they follow on from the existing records.  Then the even page needs its folio number updated to add in the odd number (e.g. so folio number 2 becomes ‘2-3’.  Then I need to delete the odd page record and after all that is done I need to regenerate the ‘next’ and ‘previous’ page links for all pages.  I completed everything except the final task, but I really need to test the script out on a version of the database running on my local PC first, as if anything goes wrong data could very easily be lost. I’ll need to tackle this next week as I ran out of time this week.

I also participated in our six-monthly formal review meeting for the Dictionaries of the Scots Language where we discussed our achievements in the past six months and our plans for the next.  I also made some tweaks to the DSL website, such as splitting up the ‘Abbreviations and symbols’ buttons into two separate links, updating the text found on a couple of the old maps pages and considering future changes to the bibliography XSLT to allow links in the ‘oral sources’

Finally this week I made a start on the Burns manuscript database for Craig Lamont.  I wrote a script that extracts the data from Craig’s spreadsheet and imports it into an online database.  We will be able to rerun this whenever I’m given a new version of the spreadsheet.  I then created an initial version of a front-end for the database within the layout for the Burns Correspondence and Poetry site. Currently the front-end only displays the data in one table with columns for type, date, content, physical properties, additional notes and locations.  The latter contains the location name, shelfmark (if applicable) and condition (if applicable) for all locations associated with a record, each on a separate line with the location name in bold.  Currently it’s possible to order the columns by clicking on them.  Clicking a second time reverses the order.  I haven’t had a chance to create any search or filter options yet but I’m intending to continue with this next week.

Week Beginning 15th August 2022

I spent the majority of the week continuing to work on the Speak For Yersel resource, working through a lengthy document of outstanding tasks that need to be completed before the site is launched in September.  First up was the output for the ‘Where do you think is speaker is from’ click activity.  The page features some explanatory text and a drop-down through which you can select a speaker.  When this is clicked on the user is presented with the option to play the audio file and can view the transcript.

I decided to make the transcript chunks visible with a green background that’s slightly different from the colour of the main area.  I thought it would be useful for people to be able to tell which of the ‘bigger’ words was part of which section, as it may well be that the word that caused a user to ‘click’ a section is not the word that we’ve picked for the section.  For example, in the Glasgow transcript ‘water’ is the chosen word for one section but I personally I clicked this section because ‘hands’ was pronounced ‘honds’.  Another reason to make the chunks visible is because I’ve managed to set up the transcript to highlight the appropriate section as the audio plays.  Currently the section that is playing is highlighted in white and this really helps to get your eye in whilst listening to the audio.

In terms of resizing the ‘bigger’ words, I chose the following as a starting point:  Less than 5% of the clicks: word is bold but not bigger (default font size for the transcript area is currently 16pt); 5-9%: 20pt;  10-14%: 25pt; 15-19%: 30pt; 20-29%: 35pt; 30-49%: 40pt; 50-74%: 45pt; 75% or more: 50pt.

I’ve also given the ‘bigger’ word a tooltip that displays the percentage of clicks responsible for its size as I thought this might be useful for people to see.  We will need to change the text, though.  Currently it says something like ‘15% of respondents clicked in this section’ but it’s actually ‘15% of all clicks for this transcript were made in this section’ which is a different thing, but I’m not sure how best to phrase it.  Where there is a pop-up for a word it appears in the blue font and the pop-up text contains the text that the team has specified.  Where the pop-up word is also the ‘bigger’ word (most but not always the case) then the percentage text also appears in the popup, below the text.  Here’s a screenshot of how the feature currently looks:

I then moved onto the ‘I would never say that’ activities.  This is a two-part activity, with the first part involving the user dragging and dropping sentences into either a ‘used in Scots’ or ‘isn’t used in Scots’ column and then checking their answers.  The second part has the user translating a Scots sentence into Standard English by dragging and dropping possible words into a sentence area.  My first task was to format the data used for the activity, which involved creating a suitable data structure in JSON and then migrating all of the data into this structure from a Word document.  With this in place I then began to create the front-end.  I’d created similar drag and drop features before (including for another section of the current resource) and therefore used the same technologies:  The jQuery UI drag and drop library (https://jqueryui.com/draggable/).  This allowed me to set up two areas where buttons could be dropped and then create a list of buttons that could be dragged.  I then had to work on the logic for evaluating the user’s answers.  This involved keeping a tally of the number of buttons that had been dropped into one or other of the boxes (which also had to take into consideration that the user can drop a button back in the original list) and when every button has been placed in a column a ‘check answers’ button appears.  On pressing, the code then fixes the draggable buttons in place and compares the user’s answers with the correct answers, adding a ‘tick’ or ‘cross’ to each button and giving an overall score in the middle.  There are multiple stages to this activity so I also had to work on the logic for loading a new set of sentences with their own introductory text, or moving onto part two of the activity if required.  Below is a screenshot of part 1 with some of the buttons dragged and dropped:

Part two of the activity involved creating a sentence by choosing words to add.  The original plan was to have the user click on a word to add it to the sentence, or click on the word in the sentence to remove it if required.  I figured that using a drag and drop method and enabling the user to move words around the sentence after they have dropped them if required would be more flexible and would fit in better with the other activities in the site.  I was just going to use the same drag and drop library that I’d used for part one, but then I spotted a further jQuery interaction called sortable that allowed for connected lists (https://jqueryui.com/sortable/#connect-lists).  This allowed items within a list to be sortable, but also for items to be dragged and dropped from one list to another.  This sounded like the ideal solution, so I set about investigating its usage.

It took some time to style the activity to ensure that empty lists were still given space on the screen, and to ensure the word button layout worked properly, but after that the ‘sentence builder’ feature worked very well – the user could move words between the sentence area and the ‘list of possible words’ area and rearrange their order as required.  I set up the code to ensure a ‘check answers’ button appeared when at least one word had been added to the sentence (disappearing again if the user removes all words).  When the ‘check answers’ button is pressed the code grabs the content of the buttons in the sentence area in the order the buttons have been added and creates a sentence from the text.  It then compares this to one of the correct sentences (of which there may be more than one).  If the answer is correct a ‘tick’ is added after the sentence and if it’s wrong a ‘cross’ is added.  If there are multiple correct answers the other correct possibilities are displayed, and if the answer was wrong all correct answers are displayed.  Then it’s on to the next sentence, or the final evaluation.  Here’s a screenshot of part 2 with some words dragged and dropped:

Whilst working on part two it became clear that the ‘sortable’ solution I’d developed for part 2 worked better than the other draggable method I’d used for part one.  This is because the ‘sortable’ solution uses HTML lists and ‘snaps’ each item into place whereas the previous method just leaves the draggable item wherever the user drops it (so long as it’s in the confines of the droppable box).  This means things can look a bit messy.  I therefore revisited part 1 and replaced the method.  This took a bit of time to implement as I had to rework a lot of the logic, but I think it was worth it.

Also this week I spent a bit of time working for the Dictionaries of the Scots Language.  I had a conversation with Pauline Graham about the workflow for updates to the online data.  I also investigated a couple of issues with entries for Ann Fergusson.  One entry (sleesh) wasn’t loading as there were spaces in the entry’s ‘slug’.  Spaces in URLs can cause issues and this is what was happening with this entry.  I updated the URL information in the database so that ‘sleesh_n1 and v’ has been changed to ‘sleesh_n1_and_v’ and this has fixed the issue.  I also updated the XML in the online system so the first URL is now <url>sleesh_n1_and_v</url>.  I checked the online database and thankfully no other entries have a space in their ‘slug’ so this issue doesn’t affect anything else.  The second issue related to an entry that doesn’t appear in the online database.  It was not in the data I was sent and wasn’t present in several previous versions of the data, so in this case something must have happened prior to the data getting sent to me.  I also had a conversation about the appearance of yogh characters in the site.

I also did a bit more work for the Books and Borrowing project this week.  I added two further library registers from the NLS to our system.  This means there should now only be one further register to come from the NLS, which is quite a relief as each register takes some time to process.  I also finally got round to processing the four registers for St Andrews, which had been on my ‘to do’ list since late July.  It was very tricky to rename the images into a format that we can use on the server because the lack of trailing zeros meant a script to batch process the images loaded them in the wrong order.  The was made worse because rather than just being numbered sequentially the image filenames were further split into ‘parts’.  For example, the images beginning ‘UYLY 207 11 Receipt book part 11’ were being processed before images beginning ‘UYLY 207 11 Receipt book part 2’ as programming languages when ordering strings consider 11, 12 etc to come before 2.  This was also then happening within each ‘part’, e.g. ‘UYLY207 15 part 43_11.jpg’ was coming before ‘UYLY207 15 part 43_2.jpg’.  It took most of the morning to sort this out, but I was then able to upload the images to the server and I create new registers, generate pages and associate images for the two new registers (207-11 and 207-15).

However, the other two registers already exist in the CMS as page records with associated borrowing records.  Each image of the register is an open spread showing two borrowing pages and we had previously decided that I should run a script to merge pages in the CMS and then associate the merged record with one of the page images.  However, I’m afraid this is going to need some manual intervention.  Looking at the images for 206-1 and comparing them to the existing page records for this register, it’s clear that there are many blank pages in the two-page spreads that have not been replicated in the CMS.  For example, page 164 in the CMS is for ‘Profr Spens’.  The corresponding image (in my renamed images) is ‘UYLY206-1_00000084.jpg’.  The data is on the right-hand page and the left-hand page is blank.  But in the CMS the preceding page is for ‘Prof. Brown’, which is on the left-hand page of the preceding image.  If I attempted to automatically merge these two page records into one this would therefore result in an error.

I’m afraid what I need is for someone who is familiar with the data to look through the images and the pages and create a spreadsheet noting which pages correspond to which image.  Where multiple pages correspond to one page I can then merge the records.  So for example: Pages 159 (id 1087) and 160 (ID 1088) are found on image UYLY206-1_00000082.jpg.  Page 161 (1089) corresponds to UYLY206-1_00000083.jpg.  The next page in the CMS is 164 (1090) and this corresponds to UYLY206-1_00000084.jpg. So a spreadsheet could have two columns:

Page ID                 Image

1087                       UYLY206-1_00000082.jpg

1088                       UYLY206-1_00000082.jpg

1089                       UYLY206-1_00000083.jpg

1900                       UYLY206-1_00000084.jpg

Also, the page numbers in the CMS don’t tally with the handwritten page numbers in the images (e.g. the page record 1089 mentioned above has page 161 but the image has page number 162 written on it).  And actually, the page numbers would need to include two pages, e.g. 162-163.  Ideally whoever is going to manually create the spreadsheet could add new page numbers as a further column and I could then fix these when I process the spreadsheet too.  This task is still very much in progress.

Also for the project this week I created a ‘full screen’ version of the Chambers map that will be pulled into an iframe on the Edinburgh University Library website when they create an online exhibition based on our resource.

Finally this week I helped out Sofia from the Iona Place-names project who as luck would have it was also wanting help with embedding a map in an iframe.  As I’d already done some investigation about this very issue for the Chambers map I was able to easily set this up for Sofia.

 

Week Beginning 8th August 2022

I should have been back at work on Monday this week, after having a lovely holiday last week.  Unfortunately I began feeling unwell over the weekend and ended up off sick on Monday and Tuesday.  I had a fever and a sore throat and needed to sleep most of the time, but it wasn’t Covid as I tested negative.  Thankfully I began feeling more normal again on Tuesday and by Wednesday I was well enough to work again.

I spent the majority of the rest of the week working on the Speak For Yersel project.  On Wednesday I moved the ‘She sounds really clever’ activities to the ‘maps’ page, as we’d decided that these ‘activities’ really were just looking at the survey outputs and so fitting in better on the ‘maps’ page.  I also updated some of the text on the ‘about’ and ‘home’ pages and updated the maps to change the gender labels, expanding ‘F’ and ‘M’ and replacing ‘NB’ with ‘other’ as this is a more broad option that better aligns with the choices offered during sign-up.  I also an option to show and hide the map filters that defaults to ‘hide’ but remembers the users selection when other map options are chosen.  I added titles to the maps on the ‘Maps’ page and made some other tweaks to the terminology used in the maps.

On Wednesday we had a meeting to discuss the outstanding tasks still left for me to tackle.  This was a very useful meeting and we managed to make some good decisions about how some of the larger outstanding areas will work.  We also managed to get confirmation from Rhona Alcorn of the DSL that we will be able to embed the old web-based version of the Schools Dictionary app for use with some of our questions, which is really great news.

One of the outstanding tasks was to investigate how the map-based quizzes could have their answer options and the correct answer dynamically generated.  This was never part of the original plan for the project, but it became clear that having static answers to questions (e.g. where do people use ‘ginger’ for ‘soft drink’) wasn’t going to work very well when the data users are looking at is dynamically generated and potentially changing all the time – we would be second guessing the outputs of the project rather than letting the data guide the answers.  As dynamically generating answers wasn’t part of the plan and would be pretty complicated to develop this has been left as a ‘would be nice if there’s time’ task, but at our call it was decided that this should now become a priority.  I therefore spent most of Thursday investigating this issue and came up with two potential methods.

The first method looks at each region individually to compare the number of responses for each answer option in the region.  It counts the number of responses for each answer option and then generates a percentage of the total number of responses in the area.  So for example:

North East (Aberdeen)

Mother: 12 (8%)

Maw: 4 (3%)

Mam: 73 (48%)

Mammy: 3 (2%)

Mum: 61 (40%)

So of the 153 current responses in Aberdeen, 73 (48%) were ‘Mam’.  The method then compares the percentages for the particular answer option across all regions to pick out the highest percentage.  The advantage of this approach is that by looking at percentages any differences caused by there being many more respondents in one region over another are alleviated.  If we look purely at counts then a region with a large number of respondents (as with Aberdeen at the moment) will end up with an unfair advantage, even for answer options that are not chosen the most.  E.g. ‘Mother’ has 12 responses, which is currently by far the most in any region, but taken as a percentage it’s roughly in line with other areas.

But there are downsides.  Any region where the option has been chosen but the total number of responses is low will end up with a large percentage.  For example, both Inverness and Dumfries & Galloway currently only have two respondents, but in each case one of these was for ‘Mam’, meaning they pass Aberdeen and would be considered the ‘correct’ answer with 50% each.  If we were to use this method then I would have to put something in place to disregard small samples.  Another downside is that as far as users are concerned they are simply evaluating dots on a map, so perhaps we shouldn’t be trying to address the bias of some areas having more respondents than others because users themselves won’t be addressing this.

This then led me to develop method 2, which only looks at the answer option in question (e.g. ‘Mam’) rather than the answer option within the context of other answer options.  This method takes a count of the number of responses for the answer option in each region and for the number generates a percentage of the total number of answers for the option across Scotland.  So for ‘Mam’ the counts and percentages are as follows:

Ayrshire

1 (1%)

Fife

2 (2%)

Glasgow

2 (2%)

North East (Aberdeen)

73 (84%)

Stirling and Falkirk

2 (2%)

Lothian (Edinburgh)

1 (1%)

Tayside and Angus (Dundee)

4 (5%)

Dumfries and Galloway

1 (1%)

Highlands (Inverness)

1 (1%)

Across Scotland there are currently a total of 87 responses where ‘Mam’ was chosen and 73 of these (84%) were in Aberdeen.  As I say, this simple solution probably mirrors how a user will analyse the map – they will see lots of dots in Aberdeen and select this option.  However, it completely ignores the context of the chosen answer.  For example, if we get a massive rush of users from Glasgow (say 2000) and 100 of these choose ‘Mam’ then Glasgow ends up being the correct answer (beating Aberdeen’s 73), even though as a proportion of all chosen answers in Glasgow 100 is only 5% (the other 1900 people will have chosen other options), meaning it would be a pretty unpopular choice compared to the 48% who chose ‘Mam’ over other options in Aberdeen as mentioned near the start.  But perhaps this is a nuance that users won’t consider anyway.

This latter issue became more apparent when I looked at the output for the use of ‘rocket’ to mean ‘stupid’.  The simple count method has Aberdeen with 45% of the total number of ‘rocket’ responses, but if you look at the ‘rocket’ choices in Aberdeen in context you see that only 3% of respondents in this region selected this option.

There are other issues we will need to consider too.  Some questions currently have multiple regions linked in the answers (e.g. lexical quiz question 4 ‘stour’ has answers ‘Edinburgh and Glasgow’, ‘Shetland and Orkney’ etc.)  We need to decide whether we still want this structure.  This is going to be tricky to get working dynamically as the script would have to join two regions with the most responses together to form the ‘correct’ answer and there’s no guarantee that these areas would be geographically next to each other.  We should perhaps reframe the question; we could have multiple buttons that are ‘correct’ and ask something like ‘stour is used for dust in several parts of Scotland.  Can you pick one?’  Or I guess we could ask the user to pick two.

We also need to decide how to handle the ‘heard throughout Scotland’ questions (e.g. lexical question 6 ‘is greetin’ heard throughout most of Scoatland’).  We need to define what we mean by ‘most of Scotland’.  We need to define this in a way that can be understood programmatically, but thinking about it, we probably also need to better define what we mean by this for users too.  If you don’t know where most of the population of Scotland is situated and purely looked at the distribution of ‘greetin’ on the map you might conclude that it’s not used throughout Scotland at all, but only in the central belt and up the East coast.  But returning to how an algorithm could work out the correct answer for this question:  We need to set thresholds for whether an option is used throughout most of Scotland or not.  Should the algorithm only look at certain regions?  Should it count the responses in each region and consider it in use in the region if (for example) 50% or more respondents chose the option?  The algorithm could then count the number of regions that meet this threshold compared to the total number of regions and if (for example) 8 out of our 14 regions surpass the threshold the answer could be deemed ‘true’.  The problem is humans can look at a map and quickly estimate an answer but an algorithm needs more rigid boundaries.

Also, question 15 of the ‘give your word’ quiz asks about the ‘central belt’ but we need to define what regions make this up.  Is it just Glasgow and Lothian (Edinburgh), for example?  We also might need to clarify this for users too.  The ‘I would never say that’ quiz has several questions where one possible answer is ‘All over Scotland’.  If we’re dynamically ascertaining the correct answer then we can’t guarantee that this answer will be one that comes up.  Also, ‘All over Scotland’ may in fact be the correct answer for questions that we haven’t considered this to be an answer for.  What should be do about this?  Two possibilities: Firstly, the code for ascertaining the correct answer (for all of the map-based quizzes) also has a threshold that when reached would mean the correct answer is ‘All over Scotland’ and this option would then be included in the question.  This could use the same logic as ‘heard throughout Scotland’ yes/no questions that I mentioned above.  Secondly, we could reframe the questions that currently have an ‘All over Scotland’ answer option to be the same as the ‘heard throughout Scotland yes/no questions as found in the lexical quiz and we don’t bother to try and work out whether an ‘all over Scotland’ option needs to be added to any of the other questions.

I also realised that we may end up with a situation where more than one region has a similar number of markers, meaning the system will still easily be able to ascertain which is correct, but users might struggle.  Do we need to consider this eventuality?  I could for example add in a check to see whether any other regions have a similar score to the ‘correct’ one and ensure any that are too close never get picked as the randomly generated ‘wrong’ answer options.  Linked to this: we need to consider whether it is acceptable that the ‘wrong’ answer options will always be randomly generated. The options will be different each time a user loads the quiz question and if they are entirely random this means the question may sometimes be very easy and other times very hard.  Do I need to update the algorithm to add some sort of weighting to how the ‘wrong’ options are chosen?  This will need further discussion with the team next week.

I decided to move onto some of the other outstanding tasks and to leave the dynamically generated map answers issue until Jennifer and Mary are back next week.  I managed to complete the majority of minor updates to the site that were still outstanding during this time, such as updating introductory and explanatory text for the surveys, quizzes and activities, removing or rearranging questions, rewording answers, reinstating the dictionary based questions and tweaking the colour and justification of some of the site text.

This leaves several big issues left to tackle before the end of the month including  dynamically generating answers for quiz questions, developing the output for the ‘click’ activity and developing the interactive activities for ‘I would never say that’.  It’s going to be a busy few weeks.

Also this week I continued to process the data for the Books and Borrowing project.  This included uploading images for one more Advocates library register from the NLS, including generating pages, associating images and fixing the page numbering to align with the handwritten numbers.  I also received images for a second register for Haddington library from the NLS, and I needed some help with this as we already have existing pages for this register in the CMS, but the number of images received didn’t match.  Thankfully the RA Kit Baston was able to look over the images and figure out what needed to be done, which included inserting new pages in the CMS and then me writing a script to associate images with records.  I also added two missing pages to the register for Dumfries Presbytery and added in a missing image for Westerkirk library.

Finally, I tweaked the XSLT for the Dictionaries of the Scots Language bibliographies to ensure the style guide reference linked to the most recent version.

Week Beginning 25th July 2022

I was on holiday for most of the previous two weeks, working two days during this period.  I’ll also be on holiday again next week, so I’ve had quite a busy time getting things done.  Whilst I was away I dealt with some queries from Joanna Kopaczyk about the Future of Scots website.  I also had to investigate a request to fill in timesheets for my work on the Speak For Yersel project, as apparently I’d been assigned to the project as ‘Directly incurred’ when I should have been ‘Directly allocated’.  Hopefully we’ll be able to get me reclassified but this is still in-progress.  I also fixed a couple of issues with the facility to export data for publication for the Berwickshire place-name project for Carole Hough, and fixed an issue with an entry in the DSL, which was appearing in the wrong place in the dictionary.  It turned out that the wrong ‘url’ tag had been added to the entry’s XML several years ago and since then the entry was wrongly positioned.  I fixed the XML and this sorted things.  I also responded to a query from Geert of the Anglo-Norman Dictionary about Aberystwyth’s new VPN and whether this would affect his access to the AND.  I also investigated an issue Simon Taylor was having when logging into a couple of our place-names systems.

On the Monday I returned to work I launched two new resources for different projects.  For the Books and Borrowing project I published the Chambers Library Map (https://borrowing.stir.ac.uk/chambers-library-map/) and reorganised the site menu to make space for the new page link.  The resource has been very well received and I’m pretty pleased with how it’s turned out.  For the Seeing Speech project I launched the new Gaelic Tongues resource (https://www.seeingspeech.ac.uk/gaelic-tongues/) which has received a lot of press coverage, which is great for all involved.

I spent the rest of the week dividing my time primarily between three projects:  Speak For Yersel, Books and Borrowing and Speech Star.  For Books and Borrowing I continued processing the backlog of library register image files that has built up.  There were about 15 registers that needed to be processed, and each needed to be handled in a different way.  This included nine registers from Advocates Library that had been digitised by the NLS, for which I needed to batch process the images to rename them, delete blank pages, create page records in the CMS and then tweak the automatically generated folio numbers to account for discrepancies in the handwritten page number in the images.  I also processed a register for the Royal High School, which involved renaming the images so they match up with image numbers already assigned to page records in the CMS, inserting new page records and updating the ‘next’ and ‘previous’ links for pages for which new images had been uncovered and generating new page records for many tens of new pages that follow on from the ones that have already been created in the CMS.  I also uploaded new images for the Craigston register and created a new register including all page records and associated image URLs for a further register for Aberdeen.  I still have some further RHS registers to do and a few from St Andrews, but these will need to wait until I’m back from my holiday.

For Speech Star I downloaded a ZIP containing 500 new ultrasound MP4 videos.  I then had to process them to generate ‘poster’ images for each video (these are images that get displayed before the user chooses to play the video).  I then had to replace the existing normalised speech database with data from a new spreadsheet that included these new videos plus updates to some of the existing data.  This included adding a few new fields and changing the way the age filter works, as much of the new data is for child speakers who have specific ages in months and years, and these all need to be added to a new ‘under 18’ age group.

For Speak For Yersel I had an awful lot to do.  I started with a further large-scale restructuring of the website following feedback from the rest of the team.  This included changing the site menu order, adding in new final pages to the end of surveys and quizzes and changing the text of buttons that appear when displaying the final question.

I then developed the map filter options for age and education for all of the main maps.  This was a major overhaul of the maps.  I removed the slide up / slide down of the map area when an option is selected as this was a bit long and distracting.  Now the map area just updates (although there is a bit of a flicker as the data gets replaced).  The filter options unfortunately make the options section rather big, which is going to be an issue on a small screen.  On my mobile phone the options section takes up 100% of the width and 80% of the height of the map area unless I press the ‘full screen’ button.  However, I figured out a way to ensure that the filter options section scrolls if the content extends beyond the bottom of the map.

I also realised that if you’re in full screen mode and you select a filter option the map exits full screen as the map section of the page reloads.  This is very annoying, but I may not be able to fix it as it would mean completely changing how the maps are loaded.  This is because such filters and options were never intended to be included in the maps and the system was never developed to allow for this.  I’ve had to somewhat shoehorn in the filter options and it’s not how I would have done things had I known from the beginning that these options were required.  However, the filters work and I’m sure they will be useful.  I’ve added in filters for age, education and gender, as you can see in the following screenshot:

I also updated the ‘Give your word’ activity that asks to identify younger and older speakers to use the new filters too.  The map defaults to showing ‘all’ and the user then needs to choose an age.  I’m still not sure how useful this activity will be as the total number of dots for each speaker group varies considerably, which can easily give the impression that more of one age group use a form compared to another age group purely because one age group has more dots overall.  The questions don’t actually ask anything about geographical distribution so having the map doesn’t really serve much purpose when it comes to answering the question.  I can’t help but think that just presenting people with percentages would work better, or some other sort of visualisation like a bar graph or something.

I then moved on to working on the quiz for ‘she sounds really clever’ and so far I have completed both the first part of the quiz (questions about ratings in general) and the second part (questions about listeners from a specific region and their ratings of speakers from regions).  It’s taken a lot of brain-power to get this working as I decided to make the system work out the correct answer and to present it as an option alongside randomly selected wrong answers.  This has been pretty tricky to implement (especially as depending on the question the ‘correct’ answer is either the highest or the lowest) but will make the quiz much more flexible – as the data changes so will the quiz.

Part one of the quiz page itself is pretty simple.  There is the usual section on the left with the question and the possible answers.  On the right is a section containing a box to select a speaker and the rating sliders (readonly).  When you select a speaker the sliders animate to their appropriate location.  I decided to not include the map or the audio file as these didn’t really seem necessary for answering the questions, they would clutter up the screen and people can access them via the maps page anyway (well, once I move things from the ‘activities’ section).  Note that the user’s answers are stored in the database (the region selected and whether this was the correct answer at the time).  Part two of the quiz features speaker/listener true/false questions and this also automatically works out the correct answer (currently based on the 50% threshold).  Note that where there is no data for a listener rating a speaker from a region the rating defaults to 50.  We should ensure that we have at least one rating for a listener in each region before we let people answer these questions.  Here is a screenshot of part one of the quiz in action, with randomly selected ‘wrong’ answers and a dynamically outputted ‘right’ answer:

I also wrote a little script to identify duplicate lexemes in categories in the Historical Thesaurus as it turns out there are some occasions where a lexeme appears more than once (with different dates) and this shouldn’t happen.  These will need to be investigated and the correct dates will need to be established.

I will be on holiday again next week so there won’t be another post until the week after I’m back.