Week Beginning 16th January 2023

I divided my time primarily between the Anglo-Norman Dictionary and Books and Borrowing this week.  For the AND I implemented a new ‘citation editing’ feature that I’d written the specification for before Christmas.  This new feature allows an editor to bring up a list of all of the citations for a source text (similar to the how this page in the front-end works: https://anglo-norman.net/search/citation/null/null/A-N_Falconry) and to then manually edit the XML for one or more citations or apply a batch edit to any selected citations, enabling the citation’s date, source text reference and/or location reference to be edited, potentially updating the XML for thousands of entries in one process.  It took a fair amount of time to implement the feature and then further time to test it.  This was especially important as I didn’t want to risk an error corrupting thousands of dictionary entries.  I set up a version of the AND system and database on my laptop so I could work on the new code there without risk to the live site.

The new feature works pretty much exactly as I’d specified in the document I wrote before Christmas, but one difference is that I realised we already had a page in the Dictionary Management System that listed all sources – the ‘Browse Sources’ page.  Rather than have an entirely new ‘Edit citations’ page that would also begin by listing the sources I decided to update the existing ‘Browse Sources’ page.  This page now features the same tabular view of the source text, but the buttons beside each text now include ‘Edit citations’.  Pressing on this will open the ‘Edit citations’ page for the source in question.  By default this lists all citations for the source ordered by headword.  Where an entry has more than one citation for a source these appear in the order they are found in the entry.  At the top of the page there is a button you can press to change the sorting to location in the source text.  This sorts the citations by the contents of the <loc> tag, displaying the headword for each entry alongside the citation.  Note that sorting currently doesn’t work logically to a human user.  The field can contain mixtures of numbers and text and therefore the field is sorted as text.  When this occurs numbers are sorted alphabetically, meaning all of the ones come before all of the twos etc.  E.g. 1,10,1002 all come before 2.  I’ll need to investigate whether I can do something about this, maybe next week.

As my document had specified, you can batch edit and / or manually edit any listed citations.  Batch editing is controlled by the checkboxes beside each citation – any that are checked will have the batch edit applied to them.  The dark blue ‘Batch Edit Options’ section allows you to decide what details to change.  You can specify a new date (ideally using the date builder feature in the DMS to generate the required XML).  You can select a different siglum, which uses an autocomplete – start typing and select the matching siglum.  However, the problem with autocompletes is what happens if you manually edit or clear the field after selecting a value.  if you manually edit the text in this field after selecting a siglum the previously selected siglum will still be used as it’s not the contents of the text field that are used in the edit but a hidden field containing the ‘slug’ of the selected siglum.  An existing siglum selected from the autocomplete should always be used here to avoid this issue.  You can also specify new contents for the <loc> tag.  Any combination of the three fields can be used – just leave the ones you don’t want to update blank.

To manually edit one or more citations you can press the ‘Edit’ button beside the citation.  This displays a text area with the current XML for the citation in it.  You can edit this XML as required, but the editors will need to be careful to ensure the updated XML is valid or things might break.  The ‘Edit’ button changes to a ‘Cancel Edit’ button when the text area opens.  Pressing on this removes the text area.  Any changes you made to the XML in the text area will be lost and pressing the ‘Edit’ button again will reopen the text area with a fresh version of the citation’s XML.

It is possible to combine manual and batch edits but manual edits are applied first meaning if you manually edit some information that is also to be batch edited the batch edit will replace the manual edit for that information.  E.g. if you manually edit the <quotation> and the <loc> and you also batch edit the <loc> the quotation and loc fields will be replaced with your manual edit first and then the loc field will be overwritten with your batch edit.  Here’s a screenshot of the citation editor page, with one manual edit section open:

Once the necessary batch / manual changes have been made, pressing the ‘Edit Selected Citations’ button at the bottom of the page submits the data and at this point the edits will be made.  This doesn’t actually edit the live entry but takes the live entry XML, edits it and then creates a new Holding Area entry for each entry in question (Holding Area entries are temporary versions of entries stored in the DMS for checking before publication).  Th process of making these holding area entries includes editing all relevant citations for each entry (e.g. the contents of each relevant <attestation> element) and checking and (if necessary) regenerating the ‘earliest date’ field for the entry as this may have changed depending on the date information supplied.  After the script has run you can then find new versions of the entries in the Holding Area, where you can check and approve the versions, making them live or deleting them as required.  I’ll probably need to add in a ‘Delete all’ option to the Holding Area as currently entries that are to be deleted need to be individually deleted, which would be annoying if there’s an entire batch to remove.

Through the version on my laptop I fully tested the process out and it all worked fine.  I didn’t actually test publishing any live entries that have passed through the citation edit process, but I have previewed them in the holding area and all look fine.  Once the entries enter the holding area they should be structurally identical to entries that end up in the holding area from the ‘Upload’ facility so there shouldn’t be any issues in publishing them.

After that I uploaded the new code to the AND server and began testing and tweaking things there before letting the AND Editor Geert loose on the new system.  All seemed to work fine with his first updates, but then he noticed something a bit strange.  He’d updated the date for all citations in one source text, meaning more than 1000 citations needed to be updated.  However, the new date (1212) wasn’t getting applied to all of the citations, and somewhere down the list the existing date (1213) took over.

After much investigation it turned out the issue was caused by a server setting rather than any problem with my code.  The server has a setting that limits the number of variables that can be inputted from a form to 1000.  The batch edit was sending more variables than this so only the first 1000 were getting through.  As the cutoff of input variables was automatically and silently made by the server my script was entirely unaware that there was any problem, hence the lack of visible errors.

I can’t change the server settings myself but I managed to get someone in IT Support to update it for me.  With the setting changed the form submitted, but unfortunately after submission all it gave was a blank page so I had another issue to investigate.  It turned out to be an issue with the data.  There were two citations in the batch that had no dateInfo tag.  When specifying a date the script expects to find an existing dateInfo tag that then gets replaced.  As it found no such tag the script quit with a fatal error.  I therefore updated the script so that it can deal with citations that have no existing dateInfo tag.  In such cases the script now inserts a new dateInfo element at the top of the <attestation> XML.  I also added a count of the number of new holding area entries the script generates so it’s easier to check if any have somehow been lost during processing (which hopefully won’t happen now).

Whilst investigating this I also realised that when batch editing a date any entry that has more than one citation that is being edited will end up with the same ID used for each <dateInfo> element.  An ID should be unique and while this won’t really cause any issues when displaying the entries it might lead to errors or warnings in Oxygen.  I therefore updated the code to add the attestation ID to the supplied dateInfo ID when batch editing dates to ensure the uniqueness of the dataInfo ID.

With all of this in place the new feature was up and running and Geert was able to batch edit the citations for several source texts.  However, he sent me a panicked email on Saturday to say that after submitting an edit every single entry in the AND was now not displaying anything other than the headword.  This was obviously a serious problem so I spent some time on Saturday investigating and fixing the issue.

The issue turned out to be nothing to do with my new system but was caused by an issue with one of the entry XML files that was updated through the citation editing system.  The entry in question was Assensement (https://anglo-norman.net/entry/assensement) which has an erroneous <label> element: <semantic value=”=assentement?”/>.  This should not be a label and attributes are not allowed to start with an equals sign.  I must have previously stripped out such errors from our list of labels, but when the entry was published the label was reintroduced.  The DTD dynamically pulls in the labels and these are then used when validating the XML.  But as this list now included ‘=assentement?’ the DTD broke.  With the DTD broken the XSLT that transforms the entry XML into HTML wouldn’t run, meaning every single entry on the site failed to load.  Thankfully after identifying the issue it was quick to fix.  I simply deleted the erroneous label and things started working again, and Geert has updated the entry’s XML to remove the error.

For the Books and Borrowing project I had a Zoom call with project PI Katie and Co-I Matt on Monday to discuss the front-end developments and some of the outstanding tasks left to do.  The main one is to implement a genre classification system for books, and we now have a plan for how to deal with these.  Genres will be applied at work level and will then filter down to lower levels.  I also spent some time speaking to Stirling’s IT people about setting up a Solr instance for the project, as discussed in posts before Christmas.  Thankfully it was possible to get this set up and by the end of the week we had a Solr instance set up that I was able to query from a script on our server.  Next week I will begin to integrate Solr queries with the front-end that I’m working on.  I also generated spreadsheets containing all of the book edition and book work data that Matt had requested and engaged in email discussions with Matt and Katie about how we might automatically generate Book Work records from editions and amalgamate some of the many duplicate book edition records that Matt had discovered whilst looking through the data.

Also this week I made a small tweak to the Dictionaries of the Scots Language, replacing the ‘email’ option in the ‘Share’ icons with a different option as the original option was no longer working.  I also had a chat with Jane Stuart-Smith about the website for the VARICS project, replied to a query from someone in Philosophy who had a website that was no longer working, replied to an email from someone who had read my posts about Solr and had some questions and replied to Sara Pons-Sanz, the organiser of last week’s Zurich event who was asking about the availability of some visualisations of the Historical Thesaurus data.  I was able to direct her to some visualisations I’d made a while back that we still haven’t made public (see https://digital-humanities.glasgow.ac.uk/2021-12-06/).

Next week I aim to focus on the development of the Books and Borrowing front-end and the integration of Solr into this.

Week Beginning 9th January 2023

I attended the workshop ‘The impact of multilingualism on the vocabulary and stylistics of Medieval English’ in Zurich this week.  The workshop ran on Tuesday and Wednesday and I travelled to Zurich with my colleagues Marc Alexander and Fraser Dallachy on Monday.  It was really great to travel to a workshop in a different country again as I’d not been abroad since before Lockdown.  I’d never been to Zurich before and it was a lovely city.  The workshop itself was great, with some very interesting papers and good opportunities to meet other researchers and discuss potential future projects.  I gave a paper on the Historical Thesaurus, its categories and data structures and how semantic web technologies may be used to more effectively structure, manage and share the Historical Thesaurus’s semantically arranged dataset.  It was a half-hour paper with 10 minutes for questions afterwards and it went pretty well.  The audience wasn’t especially technical and I’m not sure how interesting the topic was to most people, but it was well received and I’m glad I had the opportunity to both attend the event and to research the topic as I have greatly increased my knowledge of semantic web technologies such as RDF, graph databases and SPARQL, and as part of the research I managed to write a script that generated an RDF version of the complete HT category data, which may come in handy one day.

I got back home just before midnight on the Wednesday and returned to normal work first thing on Thursday.  This included submitting my expenses from the workshop and replying to a few emails that had come in regarding my office (it looks like the dry rot work is going to take a while to resolve and it also looks like I’ll have to share my temporary office) and attempting to set up web hosting for the VARICS project, which Arts IT Support seem reluctant to do.  I also looked into an issue with the DSL that Ann Ferguson had spotted and spoke to the IT people at Stirling about their current progress with setting up a Solr instance for the Books and Borrowing project.  I also replaced a selection of library register images with better versions for that project and arranged a meeting for next Monday with the project’s PI and Co-I to discuss progress with the front-end.

I spent most of Friday writing a Data Management Plan and attending a Zoom call for a new speech therapy project I’m involved with.  It’s an ESRC funding proposal involving Glasgow and Strathclyde and I’ll be managing the technical aspects.  We had a useful call and I managed to complete an initial version of the DMP that the PI is going to adapt if required.

Week Beginning 12th December 2022

There was a problem with the server on which a lot of our major sites such as the Historical Thesaurus and Seeing Speech are hosted that started on Friday and left all of the sites offline until Monday.  This was a really embarrassing and frustrating situation and I had to deal with lots of emails from users of the sites who were unable to access them.  As I don’t have command-line access to the servers all I could do was report the issue via our IT Helpdesk system.  Thankfully by mid-morning on Monday the sites were all back up again, but the incident raised serious issues about the state of Arts IT Support, who are massively understaffed at the moment.  Arts IT also refused to set up hosting for a project that we’re collaborating with Strathclyde University on, and in fact stated that they would not set up hosting for any further websites, which will have a massive negative impact on several projects that are still in the pipeline and ultimately means I will not be able to work on any new projects until this is resolved.  The PI for the new project with Strathclyde is Jane Stuart-Smith, and thankfully she was also not very happy with the situation.  We arranged a meeting with Liz Broe, who oversees Arts IT Support, to discuss the issues and had a good discussion about how we ended up in this state and how things will be resolved.  In the short-term some additional support is being drafted in from other colleges while new staff will be recruited in the medium term, and Liz has stated that hosting for new websites (including the Strathclyde one) will continue to be offered, which is quite a relief.

I also discovered this week that there has been a leak in 13 University Gardens and water has been pouring through my office.  I was already scheduled to be moved out of the building due to the dry rot that they’ve found all the way up the back wall (which my office is on) but this has made things a little more urgent.  I’m still generally working from home every day except Tuesday and apparently all my stuff has been moved to a different building, so I’ll just need to see how the process has gone when I’m back in the University next week.

In terms of actual work this week, I spent a bit more time writing my paper about the Historical Thesaurus and Semantic Web technologies for the workshop in January.  This is coming together now, although I still need to shape it into a presentation, which will take time.  I also spent some time working on the Speech Star project, updating the speech error database to fix a number of issues with the data that Eleanor had spotted and then adding in new error type descriptions for new error types that had been included.  I also added in some ancillary page content and had a chat with Eleanor about the database system the website uses.

I also spent some time working for the DSL this week.  Rhona had noted that when you perform a full text or quotation search (i.e. a search using Solr) with wildcards (e.g. chr*mas) the search results display entries with snippets that highlight the whole word where the search string occurred (e.g. ‘Christmas’).  However, when clicking through to the entry page such highlighting was not appearing, even though highlighting in the entry page does work when performing a search without wildcards.

Highlighting in the entry page was handled by a jQuery plugin, but this was not written to take wildcards into consideration and only works on full words.  I spent some time trying to figure out how to get wildcard highlighting working myself using regular expressions, but I find regular expressions to be pretty awful to work with – an ancient relic left over from computing in the 1980s and although I managed to get something working it wasn’t ideal.  Thankfully I found an existing JavaScript library called https://markjs.io/ that can handle wildcard highlighting and I was able to replace the existing plugin with this script and update the code to work with it.  I tested this out on our test DSL site and all seems to work well.  I haven’t updated the live site yet, though, as the DSL team need to test the new approach out more fully in case they encounter any problems with it.  I also noticed that there was an issue with the quotation search whereby if you returned to the search results from an entry by clicking on the ‘return to results’ button you got an empty page.  I fixed this in both our live and test sites.

I also spent some time working for the Anglo-Norman Dictionary this week.  I updated the citation search on the public website.  Previously the citation text was only added into the search results if you also search for a specific form within a siglum, for example https://anglo-norman.net/search/citation/%22tout%22/null/A-N_Falconry and ther citation searches (e.g. just selecting a siglum and / or a siglum date) would only return the entries the siglum appeared in without the individual citations.  Now the citations appear in these searches too.  For example, all citations from A-N Falconry: https://anglo-norman.net/search/citation/null/null/A-N_Falconry and all citations where the citation date is 1400: https://anglo-norman.net/search/citation/null/1400.  This also means when you view the citations by pressing on the ‘Search AND Citations’ button for a siglum in the bibliography you now see each citation for the listed entries.

I then spent most of a day thinking through all of the issues relating to the new ‘DMS citation search and edit’ feature that the editor wants me to implement and wrote an initial document detailing how the feature will work.  There has been quite a lot to think through and I thought it wise to document the feature rather than just launching into its creation without a clear plan.  I might have some time to start work on this next week as I’m working up to and including Thursday, but it depends how I get on with some other tasks I need to do for other projects.

Also this week I attended the Christmas lunch for the Books and Borrowing project in Edinburgh.  Unfortunately there was a train strike this day and I decided to get the bus through to Edinburgh.  The journey there was fine, talking about an hour and a half, but I got the 4pm bus on the way back and it was a nightmare, taking 2 hours forty minutes.  I would never get the bus between Glasgow and Edinburgh anywhere near rush hour ever again.

Week Beginning 21st November 2022

I participated in the UCU strike action on Thursday and Friday this week, so it was a three-day week for me.  I spent much of this time researching RDF, SKOS and OWL semantic web technologies in an attempt to understand them better in the hope that they might be of some use for future thesaurus projects.  I’ll be giving a paper about this at a workshop in January so I’m not going to say too much about my investigations here.  There is a lot to learn about, however, and I can see me spending quite a lot more time on this in the coming weeks.

Other than this I returned to working on the Anglo-Norman Dictionary.  I added in a line of text that had somehow been omitted from one of the Textbase XML files and added in a facility to enable project staff to delete an entry from the dictionary.  In reality this just deactivates the entry, removing it from the front-end but still keeping a record of it in the database in case the entry needs to be reinstated.  I also spoke to the editor about some proposed changes to the dictionary management system and begin to think about how these new features will function and how they will be developed.

For the Books and Borrowing project I had a chat with IT Services at Stirling about setting up an Apache Solr system for the project.  It’s looking like we will be able to proceed with this option, which will be great.  I also had a chat with Jennifer Smith about the new Speak For Yersel project areas.  It looks like I’ll be creating the new resources around February next year.  I also fixed an issue with the Place-names of Iona data export tool and discussed a new label that will be applied to data for the ‘About’ box for entries in the Dictionaries of the Scots Language.

I also prepared for next week’s interview panel and engaged in a discussion with IT Services about the future of the servers that are hosted in the College of Arts.

Week Beginning 14th November 2022

I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.

I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records.  This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow.  Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.

Solr is considerably faster than using the database approach and its querying is much more flexible.  It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database.  For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:

“London”,2211,

“Edinburgh”,119,

“Dublin”,100,

“Paris”,30,

“Edinburgh; London”,16,

“Cambridge”,4,

“Eton”,3,

“Oxford”,3,

“The Hague”,3,

“Naples”,2,

“Rome”,2,

“Berlin”,1,

“Glasgow”,1,

“Lausanne”,1,

“Venice”,1,

“York”,1

 

Format:

“8vo”,842,

“octavo”,577,

“4to”,448,

“quarto”,433,

“4to.”,88,

“8vo., plates, port., maps.”,88,

“folio”,76,

“duodecimo”,67,

“Folio”,33,

“12mo”,19,

“8vo.”,17,

“8vo., plates: maps.”,16

 

Borrower gender:

“Male”,3128,

“Unknown”,109,

“Female”,64,

“Unclear”,2

These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria.  Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop.  It seems much more sensible to use an existing solution if this is going to be possible.

In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page.  This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record.  I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page.  Here’s an example:

{

“bnid”: 1379,

“lid”: 6,

“slug”: “glasgow-university”,

“lname”: “Glasgow University Library”,

“rid”: 2,

“rname”: “3”,

“syear”: 1760,

“eyear”: 1765,

“rtype”: “Student”,

“pid”: 107,

“fnum”: “4r”,

“transcription”: “Euseb: Eclesiastical History”,

“bday”: 17,

“bmonth”: 9,

“byear”: 1760,

“rday”: 1,

“rmonth”: 10,

“ryear”: 1760,

“borrowed”: “1760-09-17”,

“returned”: “1760-10-01”,

“bdayofweek”: “Wednesday”,

“rdayofweek”: “Wednesday”,

“originaltitle”: “”,

“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”,

“brids”: [“1”],

“bfnames”: [“Charles”],

“bsnames”: [“Wilson”],

“bfullnames”: [“Charles Wilson”],

“boccs”: [“University Student”, “Education”],

“bgenders”: [“Male”],

“aids”: [“74”],

“asnames”: [“Eusebius of Caesarea”],

“afullnames”: [” Eusebius of Caesarea”],

“beids”: [“88”],

“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned  historiographers, Eusebius, Socrates, and Evagrius.”],

“estcs”: [“R21513”],

“langs”: [“English”],

“pubplaces”: [“London”],

“formats”: [“folio”]

}

I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records).  I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server.  I then created a Solr Core for the data and specified an appropriate schema.  This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.

It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data.  I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected.  These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out.  As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.

Some issues encountered include:

  1. Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
  2. Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
  3. Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
  4. Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
  5. Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘&quote;’ as their appearance in the JSON caused the structure to be invalid.  I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.

However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well.  And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.

The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling.  We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server.  This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html).  Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs.  The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html

When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried.  This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html

I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users.  Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server,  or if Solr is going to be installed on the same server then limiting access to localhost could work.

In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud).  Once Solr is running we’d need a Core to be created.  I have the schema file the core would require and can give instructions about setting this up.  I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.

One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process.  We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours.  Zipping them up then uploading the zip file and extracting the file also takes a long time).  Then someone with command-line access to the server will need to run the command to ingest the data.  We’ll need to see if Stirling are prepared to do this for us.

Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B.  I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.

Other than the above B&B work I did spent a bit of time on other projects.  I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site.  I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page.  I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week.  I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.

Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.

Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed.  So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):

<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.

Sic dreich wark. . . . For lang I tholed an’ fendit.

Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.

And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.

It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.

Driche and sair yer pain.

And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.

See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.

</field>

The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”.  We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).

The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10.  If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached.  ‘Dreich’ actually occurs many more than 10 times in this entry.  We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.

As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations.  So the first snippet for ‘Dreich Adj’ is:

“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”

Which actually comprises the text from almost the entire first two quotes, while the next snippet:

“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”

 

Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).

So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations.  It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.

I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over.  However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.

We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech.  This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries.  However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.

I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr.  The display of the search results will need to be reworked as the search will now be based around quotations rather than entries.  I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use.  It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries.  The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry.  I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.

The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future.  This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops:  you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to.  The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops).  More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.

Week Beginning 7th November 2022

I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project.  It was a good event and I hope there will be more like it in future.  Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project.  I will be very interested to hear how Curious Travellers gets on with the tool in future.  Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools.  It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.

Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host.  We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily.  As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure.  Hopefully we’ll hear more soon.

The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project.  One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that.  Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.

I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box.  This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page.  The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover.  For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records.  Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.

The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page.  The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record.  These appear as links and if you press on the page link this will open the page centred on the selected borrowing record.  For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:

The non-date search also works, but is currently a bit too slow.  For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long.  Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields.  This currently makes the matched text upper case, but I don’t intend to leave it like this.  You can also search for things like the ESTC too.

However, there are several things I’m not especially happy about:

  1. The speed issue: the current approach is just too slow
  2. Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name)  many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
  3. The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this.  Should we have a more compact view of results?  If so what data should be displayed?  The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
  4. This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores:  You see the search results and then there are checkboxes that allow you to narrow down the results.  For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more.  Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
  5. Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc.  I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.

I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.

Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching.  We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it.  Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data.  But if we are potentially going to use Solr then we would need to install it on a server at Stirling.  Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check.  I should have a better idea as to whether Solr is what we need by the end of next week, all being well.

Also this week I spent some time working on the Speech Star project.  I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters.  I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field.  This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end.  After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:

I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.

Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.

Week Beginning 31st October 2022

I spent a lot of the week continuing to work on the Books and Borrowing front end.  To begin with I worked on the ‘borrowers’ tab in the ‘library’ page and created an initial version of it.  Here’s an example of how it looks:

As with books, the page lists borrowers alphabetically, in this case by borrower surname.  Letter tabs and counts of the number of borrowers with surnames beginning with the letter appear at the top and you can select a letter to view all borrowers with surnames beginning with the letter.  I had to create a couple of new fields in the borrower table to speed the querying up, saving the initial letter of each borrower’s surname and a count of their borrowings.

The display of borrowers is similar to the display of books, with each borrower given a box that you can press on to highlight.  Borrower ID appears in the top right and each borrower’s full name appears as a green title.  The name is listed as it would be read, but this could be updated if required.  I’m not sure where the ‘other title’ field would go if we did this, though – presumably something like ‘Macdonald, Mr Archibald of Sanda’.

The full information about a borrower is listed in the box, including additional fields and normalised occupations.  Cross references to other borrowers also appear.  As with the ‘Books’ tab, much of this data will be linked to search results once I’ve created the search options (e.g. press on an occupation to view all borrowers with this occupation, press on the number of borrowings to view the borrowings) but this is not in place yet.  You can also change the view from ‘surname’ to ‘top 100 borrowers’, which lists the top 100 most prolific borrowers (or less if there are less than 100 borrowers in the library).  As with the book tab, a number appears at the top left of each record to show the borrower’s place on the ‘hitlist’ and the number of borrowings is highlighted in red to make it easier to spot.

I also fixed some issues with the book and author caches that were being caused by spaces at the start of fields and author surnames beginning with a non-capitalised letter (e.g. ‘von’) which was messing things up as the cache generation script was previously only matching upper case, meaning ‘v’ wasn’t getting added to ‘V’.  I’ve regenerated the cache to fix this.

I then decided to move onto the search rather than the ‘Facts & figures’ tab as I reckoned this should be prioritised.  I began work on the quick search initially, and I’m still very much in the middle of this.  The quick search has to search an awful lot of and to do this several different queries need to be run.  I’ll need to see how this works in terms of performance as I fear the ‘quick’ search risks being better named the ‘slow’ search.

We’ve stated that users will be able to search for dates in the quick search and these need to be handled differently.  For now the API checks to see whether the passed search string is a date by running a pattern match on the string.  This converts all numbers in the string into an ‘X’ character and then checks to see whether the resulting string matches a valid date form.  For the API I’m using a bar character (|) to designate a ranged date and a dash to designate a division between day, month and year.  I can’t use a slash (/) as the search string is passed in the URL and slashes have meaning in URLs.  For info, here are the valid date string patterns:

“XXXX”,”XXXX-XX”,”XXXX-XX-XX”,”XXXX|XXXX”,”XXXX|XXXX-XX”,”XXXX|XXXX-XX-XX”,”XXXX-XX|XXXX”,”XXXX-XX|XXXX-XX”,”XXXX-XX|XXXX-XX-XX”,”XXXX-XX-XX|XXXX”,”XXXX-XX-XX|XXXX-XX”,”XXXX-XX-XX|XXXX-XX-XX”

So for example, if someone searches for ‘1752’ or ‘1752-03’ or ‘1752-02|1755-07-22’ the system will recognise these as a date search and process them accordingly.  I should point out that I can and probably will get people to enter dates in a more typical way in the front-end, using slashes between day, month and year and a dash between ranged dates (e.g. ‘1752/02-1755/07/22’) but I’ll convert these before passing the string to the API in the URL.

I have the query running to search the dates, and this in itself was a bit complicated to generate as including a month or a month and a day in a ranged query changes the way the query needs to work.  E.g. if the user searches for ‘1752-1755’ then we need to return all borrowing records with a borrowed year of ‘1752’ or later and ‘1755’ or earlier.  However, if the query is ‘1752/06-1755-03’ then the query can’t just be ‘all borrowed records with a borrowed year of ‘1752’ or later and a borrowed month of ‘06’ or later and a borrowed year of ‘1755’ or earlier and a borrowed month of ‘03’ or earlier as this would return no results.  This is because the query is looking to return borrowings with a borrowed month of ‘06’ or later and also ‘03’ or earlier.  Instead the query needs to find borrowing records that have a borrowed year of 1752 AND a borrowed month of ‘06’ or later OR have a borrowed year later than 1752 AND have a borrowed year of 1755 AND a borrowed month of ‘03’ or earlier OR have a borrowed year earlier than 1755.

I also have the queries running that search for all necessary fields that aren’t dates.  This currently requires five separate queries to be run to check fields like author names, borrower occupations, book edition fields such as ESTC etc.  The queries currently return a list of borrowing IDs, and this is as far as I’ve got.  I’m wondering now whether I should create a cached table for the non-date data queried by the quick search, consisting of a field for the borrowing ID and a field for the term that needs to be searched, with each borrowing having many rows depending on the number of terms they have (e.g. a row for each occupation of every borrower associated with the borrowing, a row for each author surname, a row for each forename, a row for ESTC).  This should make things much speedier to search, but will take some time to generate.  I’ll continue to investigate this next week.

Also this week I updated the structure of the Speech Star database to enable each prompt to have multiple sounds etc.  I had to update the non-disordered page and the child error page to work with the new structure, but it seems to be working.  I also had to update the ‘By word’ view as previously sound, articulation and position were listed underneath the word and above the table.  As these fields may now be different for each record in the table I’ve removed the list and have instead added the data as columns to the table.  This does however mean that the table contains a lot of identical data for many of the rows now.

I then added in tooptip / help text containing information about what the error types mean in the child speech error database.  On the ‘By Error Type’ page the descriptions currently appear as small text to the right of the error type title.  On the ‘By Word’ page  the error type column has an ‘i’ icon after the error type.  Hovering over or pressing on this displays a tooltip with the error description, as you can see in the following screenshot:

I also updated the layout of the video popups to split the metadata across two columns and also changed the order of the errors on the ‘By error type’ page so that the /r/ errors appear in the correct alphabetical order for ‘r’ rather than appearing first due to the text beginning with a slash.  With this all in place I then replicated the changes on the version of the site that is going to be available via the Seeing Speech URL.

Kirsteen McCue contacted me last week to ask for advice on a British Academy proposal she’s putting together and after asking some questions about the project I wrote a bit of text about the data and its management for her.  I also sorted out my flights and accommodation for the workshop I’m attending in Zurich in January and did a little bit of preparation for a session on Digital Humanities that Luca Guariento has organised for next Monday.  I’ll be discussing a couple of projects at this event.  I also exported all of the Speak For Yersel survey data and sent this to a researcher who is going to do some work with the data and fixed an issue that had cropped up with the Place-names of Kirkcudbright website.  I also spent a bit of time on DSL duties this week, helping with some user account access issues and discussing how links will be added from entries to the lengthy essay on the Scots Language that we host on the site.

 

Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.

Week Beginning 10th October 2022

I spent quite a bit of time finishing things off for the Speak For Yersel project.  I created a stats page for the project team to access.  The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day).  If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).

The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.

The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere.  Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys.  I might need to update the page to add more in at a later date.  For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions.  Also, I haven’t included stats about the quizzes or activities yet, but these could be added.

I also worked on an abstract about the project for the Digital Humanities 2023 conference.  In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project.  It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week.  I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data.  I sent this to Jennifer for feedback and then wrote a second version.  Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.

Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary.  It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool.  I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.

I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.

I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page.  I’d started on the API endpoint for this last week, which returned all books for a library and then processed them.  This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use.  Also ordering by number of borrowings and by author requires all books to be returned and processed.  This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews  that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.

I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed.  Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable).  These will eventually be linked to the search.

The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected.  Instead I updated the underlying database to add two new fields to the book holding table.  The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record.  I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS.  Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it.  This makes things much more efficient as less data is being processed at any one time.

I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title.  Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters.  For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.

That’s all for this week.  I’ll be on holiday next week so there won’t be a further report until the week after that.

 

 

Week Beginning 3rd October 2022

The Speak For Yersel project launched this week and is now available to use here: https://speakforyersel.ac.uk/.  It’s been a pretty intense project to work on and has required much more work than I’d expected, but I’m very happy with the end result.  We didn’t get as much media attention as we were hoping for, but social media worked out very well for the project and in the space of a week we’d had more than 5,000 registered users completing thousands of survey questions.  I spent some time this week tweaking things after the launch.  For example, I hadn’t added the metadata tags required by Twitter and Facebook / WhatsApp to nicely format links to the website (for example the information detailed here https://developers.facebook.com/docs/sharing/webmasters/) and it took a bit of time to add these in with the correct content.

I also gave some advice to Anja Kuschmann at Strathclyde about applying for a domain for the new VARICS project I’m involved with and investigated a replacement batch of videos that Eleanor had created for the Seeing Speech website.  I’ll need to wait until she gets back to me with files that match the filenames used on the existing site before I can take this further, though.  I also fixed an issue with the Berwickshire place-names website which has lost its additional CSS and investigated a problem with the domain for the Uist Saints website that has still unfortunately not been resolved.

Other than these tasks I spent the rest of the week continuing to develop the front-end for the Books and Borrowing project.  I completed an initial version of the ‘page’ view, including all three views (image, text and image and text).  I added in a ‘jump to page’ feature, allowing you (as you might expect) to jump directly to any page in the register when viewing a page.  I also completed the ‘text’ view of the page, which now features all of the publicly accessible data relating to the records – borrowing records, borrowers, book holding and item records and any associated book editions and book works, plus associated authors.  There’s an awful lot of data and it took quite a lot of time to think about how best to lay it all out (especially taking into consideration screens of different sizes), but I’m pretty happy with how this first version looks.

Currently the first thing you see for a record is the transcribed text, which is big and green.  Then all fields relating to the borrowing appear under this.  The record number as it appears on the page plus the record’s unique ID are displayed in the top right for reference (and citation).  Then follows a section about the borrower, with the borrower’s name in green (I’ve used this green to make all of the most important bits of text stand out from the rest of the record but the colour may be changed in future).  Then follows the information about the book holding and any specific volumes that were borrowed.  If there is an associated site-wide book edition record (or records) these appear in a dark grey box, together with any associated book work record (although there aren’t many of these associations yet).  If there is a link to a library record this appears as a button on the right of the record.  Similarly, if there’s an ESTC and / or other authority link for the edition these appear to the right of the edition section.

Authors now cascade down through the data as we initially planned.  If there’s an author associated with a work it is automatically associated with and displayed alongside the edition and holding.  If there’s an author associated with an editon but not a work it is then associated with the holding.  If a book at a specific level has an author specified then this replaces any cascading author from this point downwards in the sequence.  Something that isn’t in place yet are the links from information to search results, as I haven’t developed the search yet.  But eventually things like borrower name, author, book title etc will be links allowing you to search directly for the items.

One other thing I’ve added in is the option to highlight a record.  Press anywhere in a record and it is highlighted in yellow.  Press again to reset it.  This can be quite useful as you’re scrolling through a page with lots of records on if there are certain records you’re interested in.  You can highlight as many records as you want.  It’s possible that we may add other functionality to this, e.g. the option to download the data for selected records.  Here’s a screenshot of the text view of the page:

I also completed the ‘image and text’ view.  This works best on a large screen (i.e. not a mobile phone, although it is just about possible to use it on one, as I did test this out).  The image takes up about 60% of the screen width and the text takes up the remaining 40%.   The height of the records section is fixed to the height of the image area and is scrollable, so you can scroll down the records whilst still viewing the image (rather than the whole page scrolling and the image disappearing off the screen).  I think this view works really well and the records are still perfectly usable in the more confined area and it’s great to be able to compare the image and the text side by side.  Here’s a screenshot of the same page when viewing both text and image:

I tested the new interface out with registers from all of our available libraries and everything is looking good to me.  Some registers don’t have images yet, so I added in a check for this to ensure that the image views and page thumbnails don’t appear for such registers.  After that I moved onto developing the interface to browse book holdings when viewing a library.  I created an API endpoint for returning all of the data associated with holding records for a specified library.  This includes all of the book holding data, information about each of the book items associated with the holding record (including the number of borrowing records for each), the total number of borrowing records for the holding, any associated book edition and book work records (and there may be multiple editions associated with each holding) plus any authors associated with the book.  Authors cascade down through the record as they do when viewing borrowing records in the page.  This is a gigantic amount of information, especially as libraries may have many thousands of book holding records.  The API call loads pretty rapidly for smaller libraries (e.g. Chambers Library with 961 book holding records) but for larger ones (e.g. St Andrews with over 8,500 book holding records) the API call takes too long to return the data (in the latter case it takes about a minute and returns a JSON file that’s over 6Mb in size).  The problem is the data needs to be returned in full in order to do things like order it by largest number of borrowings.  Clearly dynamically generating the data each time is going to be too slow so instead I am going to investigate caching the data.  For example, that 6Mb JSON file can just site there as an actual file rather than being generated each time.  Instead I will write a script to regenerate the cached files and I can run this whenever data gets updated (or maybe once a week whilst the project is still active).  I’ll continue to work on this next week.