Week Beginning 21st November 2022

I participated in the UCU strike action on Thursday and Friday this week, so it was a three-day week for me.  I spent much of this time researching RDF, SKOS and OWL semantic web technologies in an attempt to understand them better in the hope that they might be of some use for future thesaurus projects.  I’ll be giving a paper about this at a workshop in January so I’m not going to say too much about my investigations here.  There is a lot to learn about, however, and I can see me spending quite a lot more time on this in the coming weeks.

Other than this I returned to working on the Anglo-Norman Dictionary.  I added in a line of text that had somehow been omitted from one of the Textbase XML files and added in a facility to enable project staff to delete an entry from the dictionary.  In reality this just deactivates the entry, removing it from the front-end but still keeping a record of it in the database in case the entry needs to be reinstated.  I also spoke to the editor about some proposed changes to the dictionary management system and begin to think about how these new features will function and how they will be developed.

For the Books and Borrowing project I had a chat with IT Services at Stirling about setting up an Apache Solr system for the project.  It’s looking like we will be able to proceed with this option, which will be great.  I also had a chat with Jennifer Smith about the new Speak For Yersel project areas.  It looks like I’ll be creating the new resources around February next year.  I also fixed an issue with the Place-names of Iona data export tool and discussed a new label that will be applied to data for the ‘About’ box for entries in the Dictionaries of the Scots Language.

I also prepared for next week’s interview panel and engaged in a discussion with IT Services about the future of the servers that are hosted in the College of Arts.

Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.

Week Beginning 31st October 2022

I spent a lot of the week continuing to work on the Books and Borrowing front end.  To begin with I worked on the ‘borrowers’ tab in the ‘library’ page and created an initial version of it.  Here’s an example of how it looks:

As with books, the page lists borrowers alphabetically, in this case by borrower surname.  Letter tabs and counts of the number of borrowers with surnames beginning with the letter appear at the top and you can select a letter to view all borrowers with surnames beginning with the letter.  I had to create a couple of new fields in the borrower table to speed the querying up, saving the initial letter of each borrower’s surname and a count of their borrowings.

The display of borrowers is similar to the display of books, with each borrower given a box that you can press on to highlight.  Borrower ID appears in the top right and each borrower’s full name appears as a green title.  The name is listed as it would be read, but this could be updated if required.  I’m not sure where the ‘other title’ field would go if we did this, though – presumably something like ‘Macdonald, Mr Archibald of Sanda’.

The full information about a borrower is listed in the box, including additional fields and normalised occupations.  Cross references to other borrowers also appear.  As with the ‘Books’ tab, much of this data will be linked to search results once I’ve created the search options (e.g. press on an occupation to view all borrowers with this occupation, press on the number of borrowings to view the borrowings) but this is not in place yet.  You can also change the view from ‘surname’ to ‘top 100 borrowers’, which lists the top 100 most prolific borrowers (or less if there are less than 100 borrowers in the library).  As with the book tab, a number appears at the top left of each record to show the borrower’s place on the ‘hitlist’ and the number of borrowings is highlighted in red to make it easier to spot.

I also fixed some issues with the book and author caches that were being caused by spaces at the start of fields and author surnames beginning with a non-capitalised letter (e.g. ‘von’) which was messing things up as the cache generation script was previously only matching upper case, meaning ‘v’ wasn’t getting added to ‘V’.  I’ve regenerated the cache to fix this.

I then decided to move onto the search rather than the ‘Facts & figures’ tab as I reckoned this should be prioritised.  I began work on the quick search initially, and I’m still very much in the middle of this.  The quick search has to search an awful lot of and to do this several different queries need to be run.  I’ll need to see how this works in terms of performance as I fear the ‘quick’ search risks being better named the ‘slow’ search.

We’ve stated that users will be able to search for dates in the quick search and these need to be handled differently.  For now the API checks to see whether the passed search string is a date by running a pattern match on the string.  This converts all numbers in the string into an ‘X’ character and then checks to see whether the resulting string matches a valid date form.  For the API I’m using a bar character (|) to designate a ranged date and a dash to designate a division between day, month and year.  I can’t use a slash (/) as the search string is passed in the URL and slashes have meaning in URLs.  For info, here are the valid date string patterns:

“XXXX”,”XXXX-XX”,”XXXX-XX-XX”,”XXXX|XXXX”,”XXXX|XXXX-XX”,”XXXX|XXXX-XX-XX”,”XXXX-XX|XXXX”,”XXXX-XX|XXXX-XX”,”XXXX-XX|XXXX-XX-XX”,”XXXX-XX-XX|XXXX”,”XXXX-XX-XX|XXXX-XX”,”XXXX-XX-XX|XXXX-XX-XX”

So for example, if someone searches for ‘1752’ or ‘1752-03’ or ‘1752-02|1755-07-22’ the system will recognise these as a date search and process them accordingly.  I should point out that I can and probably will get people to enter dates in a more typical way in the front-end, using slashes between day, month and year and a dash between ranged dates (e.g. ‘1752/02-1755/07/22’) but I’ll convert these before passing the string to the API in the URL.

I have the query running to search the dates, and this in itself was a bit complicated to generate as including a month or a month and a day in a ranged query changes the way the query needs to work.  E.g. if the user searches for ‘1752-1755’ then we need to return all borrowing records with a borrowed year of ‘1752’ or later and ‘1755’ or earlier.  However, if the query is ‘1752/06-1755-03’ then the query can’t just be ‘all borrowed records with a borrowed year of ‘1752’ or later and a borrowed month of ‘06’ or later and a borrowed year of ‘1755’ or earlier and a borrowed month of ‘03’ or earlier as this would return no results.  This is because the query is looking to return borrowings with a borrowed month of ‘06’ or later and also ‘03’ or earlier.  Instead the query needs to find borrowing records that have a borrowed year of 1752 AND a borrowed month of ‘06’ or later OR have a borrowed year later than 1752 AND have a borrowed year of 1755 AND a borrowed month of ‘03’ or earlier OR have a borrowed year earlier than 1755.

I also have the queries running that search for all necessary fields that aren’t dates.  This currently requires five separate queries to be run to check fields like author names, borrower occupations, book edition fields such as ESTC etc.  The queries currently return a list of borrowing IDs, and this is as far as I’ve got.  I’m wondering now whether I should create a cached table for the non-date data queried by the quick search, consisting of a field for the borrowing ID and a field for the term that needs to be searched, with each borrowing having many rows depending on the number of terms they have (e.g. a row for each occupation of every borrower associated with the borrowing, a row for each author surname, a row for each forename, a row for ESTC).  This should make things much speedier to search, but will take some time to generate.  I’ll continue to investigate this next week.

Also this week I updated the structure of the Speech Star database to enable each prompt to have multiple sounds etc.  I had to update the non-disordered page and the child error page to work with the new structure, but it seems to be working.  I also had to update the ‘By word’ view as previously sound, articulation and position were listed underneath the word and above the table.  As these fields may now be different for each record in the table I’ve removed the list and have instead added the data as columns to the table.  This does however mean that the table contains a lot of identical data for many of the rows now.

I then added in tooptip / help text containing information about what the error types mean in the child speech error database.  On the ‘By Error Type’ page the descriptions currently appear as small text to the right of the error type title.  On the ‘By Word’ page  the error type column has an ‘i’ icon after the error type.  Hovering over or pressing on this displays a tooltip with the error description, as you can see in the following screenshot:

I also updated the layout of the video popups to split the metadata across two columns and also changed the order of the errors on the ‘By error type’ page so that the /r/ errors appear in the correct alphabetical order for ‘r’ rather than appearing first due to the text beginning with a slash.  With this all in place I then replicated the changes on the version of the site that is going to be available via the Seeing Speech URL.

Kirsteen McCue contacted me last week to ask for advice on a British Academy proposal she’s putting together and after asking some questions about the project I wrote a bit of text about the data and its management for her.  I also sorted out my flights and accommodation for the workshop I’m attending in Zurich in January and did a little bit of preparation for a session on Digital Humanities that Luca Guariento has organised for next Monday.  I’ll be discussing a couple of projects at this event.  I also exported all of the Speak For Yersel survey data and sent this to a researcher who is going to do some work with the data and fixed an issue that had cropped up with the Place-names of Kirkcudbright website.  I also spent a bit of time on DSL duties this week, helping with some user account access issues and discussing how links will be added from entries to the lengthy essay on the Scots Language that we host on the site.