Week Beginning 28th August 2023

I spent pretty much the whole week working on the new date facilities for the Dictionaries of the Scots Language.  I have now migrated the headword search to Solr, which was a fairly major undertaking, but was necessary to allow headword searches to be filtered.  I decided to create a new Solr core for DSL entries that would be used by both the headword search and the fulltext / fulltext with no quotes searches.  This made sense because I would otherwise have needed to update the Solr core for fulltext to add the additional fields needed for filtering anyway.  With the new core in place I then updated the script I wrote to generate the Solr index to include the new fields (e.g. headword, forms, dates) and generated the data, which I then imported into Solr.

With the new Solr core populated with data I then updated the API to work with it, replacing the existing headword search which queried the database with a new search than instead connects to Solr.  As all of the fields that get returned by a search are now stored in Solr the database no longer needs to be queried at all, which should make things faster.  Previously the fulltext search queried the Solr index and then once entries were returned the database was then queried for each of these to add in the other necessary fields, which was a bit inefficient.

With the API updated the website (still only the version running on my laptop) then automatically used the new Solr index for headword searches: the quick search, the predictive search and the headword search in the advanced search.  I did some tests comparing the site on my laptop to the live site and things are looking good, but I will need to tweak the default use of wildcards.  The new headword search matches exact terms by default, which is the equivalent on the live site of surrounding the term with quotes, (something the quick search does by default anyway).  I can’t really tweak this easily until I move the new site to our test server, though, as Windows (which my laptop uses) can’t cope with asterisks and quotes in filenames, which means the website on my laptop breaks if a URL includes these characters.

With the new headword and fulltext index in place I then moved on to implementing the date filtering options.  In order to do so I realised I would also have to add the entry attestation dates (from and to) to the quotation index as well, as a ‘first attested’ filter on a quotation search will use these dates.  This meant updating the quotation Solr index structure, tweaking my script that generates the quotation data for Solr, running the script to output the data and then ingesting this into Solr, all of which took some time.

I then worked with the Solr admin interface to figure out how to perform a filter query for both first attestation and the ‘in use’ period.  ‘First attested’ was pretty straightforward as a single year is queried.  Say the year is 1658 a filter would bring it back if the filter was a single year that matched (i.e. 1658) or a range that contained the year (e.g. 1650-1700).  The ‘in use’ filter was much more complex to figure out as the data to be queried is a range.  If the attestation period is 1658-1900 and a single year filter is given (e.g. 1700) then we need to check whether this year is within the range.  What is more complicated is when the filter is also a range.  E.g. 1600-1700 needs to return the entry even though 1600 is less than 1658 and 1850-2000 needs to return the entry even though 2000 is greater than 1900.  1600-2000 also needs to return the entry even though both ends extend beyond the period, but 1660-1670 also needs to return the entry as both ends are entirely within the period.

The answer to this headache-inducing problem was to run a query that checked whether the start date of the filter range was less than the end date of the attestation range and the end date of the filter range was greater than the start date of the attestation range.  So for example the attestation range is 1658-1900.  Filter range 1 is 1600-1700.  1600 is less than 1900 and 1700 is greater than 1658 so the entry is returned.  Filter range 2 is 1850-2000.  1850 is less than 1900 and 200 is greater than 1658 so the entry is returned.  Filter range 3 is 1600-2000.  1600 is less than 1900 and 2000 is greater than 1658 so the entry is returned.  Filter range 4 is 1660-1670.  1660 is less and 1900 and 1670 is greater than 1658 so the entry is returned.  Filter range 5 is 1600-1650.  1600 is less than 1900 but 1650 is not greater than 1658 so the entry is not returned.  Filter range 6 is 1901-1950.  1901 is not less than 1900 but 1950 is greater than 1658 so the entry is not returned.

Having figured out how to implement the filter query in Solr I then needed to update the API to take filter query requests, process them, format the query and then pass this to Solr.  This was a pretty major update and took quite some time to implement, especially as the quotation search needed to be handled differently for the ‘in use’ search, which as agreed with the team was to query the dates of the individual quotations rather than the overall period of attestation for an entry.  I managed to get it all working, though, allowing me to pass searches to the API by changing variables in a URL and filter the results by passing further variables.

With this in place I could then update the front-end to add in the option of filtering the results.  I decided to add the option as a box above the search results.  Originally I was going to place it down the left-hand side, but space is rather limited due to the two-column layout of a search result that covers both SND and DOST.  The new ‘Filter the results’ box consists of buttons for choosing between ‘First attested’ and ‘In use’ and ‘from’ and ‘to’ boxes where years can be entered.  There is also an ‘Update’ button and a ‘clear’ button.  Supplying a filter and pressing ‘Update’ reloads the page with the results filtered based on your criteria.  It will be possible to bookmark or cite a filtered search as the filters are added to the page URL.  The filter box appears on all search results pages, including the quick search, and seems to be working as intended.

So for example the screenshot below shows a filtered fulltext search for ‘burn’.  Without the filter this brings back more results than we allow, but if you filter the results to only those that are first attested between 1600 and 1700 a more reasonable number is returned, as the screenshot shows:

The second screenshot shows the entries that were in use during this period rather than first attested, which as you can see gives a larger number of results:

As mentioned, the ‘in use’ filter works differently for quotations, limiting those that are displayed to ones in the filter period.  The screenshot below shows an ‘in use’ filter of 1650 to 1750 for a quotation search for ‘burn’:

The filter is ‘remembered’ when you navigate to an entry and then use the ‘back to search results’ button.  You can clear the filter by pressing on the ‘clear’ button or deleting the years in the ‘from’ and ‘to’ boxes and pressing ‘update’.  Previously if there was only one search result the results page would automatically redirect to the entry.  This was also happening when an applied filter only gave one result, which I found very annoying so instead if a filter is present and only one result is returned the results page is still displayed instead. Next week I’ll work on adding the first attested dates to the search results and I’ll also begin to develop the sparklines.

Other than this I had a meeting with Joanna Kopaczyk to further discuss a project she’s putting together.  It looks like it will be a fairly small pilot project to begin with and I’ll only be involved in a limited capacity, but it has great potential.

 

Week Beginning 21st August 2023

I spent most of this week working for the Dictionaries of the Scots Language, working on the new quotation date search.  I decided to work on the update on a version of the site and its data running on my laptop initially, as I have direct control over the Solr instance running on my laptop – something I don’t have on the server.  My first task was to create a new Solr index for the quotations and to write a script to export data from the database in a format that Solr could then index.  With over 700,000 quotations this took a bit of time, and I did encounter some issues, such as several tens of thousands of quotations not having date tags, meaning dates for the quotations could not be extracted.  I had a lengthy email conversation with the DSL team about this and thankfully it looks like the issue is not something I need to deal with:  data is being worked on in their editing system and the vast majority of the dating issues I’d encountered will be fixed the next time the data is exported for me to use.  I also encountered some further issues that needed o be addressed as I worked with the data.  For example, I realised I needed to add a count of the total number of quotes for an entry to each quote item in Solr to be able to work out the ranking algorithm for entries and this meant updating the export script, the structure of the Solr index and then re-exporting all 700,000 quotations.  Below is screenshot of the Solr admin interface, showing a query of the new quotation index – a search for ‘barrow’.

With this in place I then needed to update the API that processes search requests, connects to Solr and spits out the search results in a suitable format for use on the website.  This meant completely separating out and overhauling the quotation search, as it needed to connect to a different Solr index that featured data that had a very different structure.  I needed to ensure quotations could be grouped by their entries and then subjected to the same ‘max results’ limitations as other searches.  I also needed to create the ranking algorithm for entries based on the number of returned quotes vs the total number of quotes, sort the entries based on this and also ensure a maximum of 10 quotes per entry were displayed.  I also had to add in a further search option for dates, as I’d already detailed in the requirements document I’d previously written.  The screenshot below is of the new quotation endpoint in the API, showing a section of the results for ‘barrow’ in ‘snd’ between 1800 and 1900.

The next step was to update the front-end to add in the new ‘date’ drop-down when quotations are selected and then to ensure the new quotation search information could be properly extracted, formatted and passed to the API to return the relevant data.  The following screenshot shows the search form.  The explanatory text still needs some work as it currently doesn’t feel very elegant – I think there’s a ‘to’ missing somewhere.

The final step for the week was to deal with the actual results themselves, as they are rather different in structure to the previous results, as entries now potentially have multiple quotes, each of which contains information relating to the quote (e.g. dates, bib ID) and each of which may feature multiple snippets, if the term appears several times within a single quote.  I’ve managed to get the results to display correctly and the screenshot below shows the results of a search for ‘barrow’ in snd between 1800 and 1900.

The new search also now lets you perform a Boolean search on the contents of individual quotations rather than all quotations in an entry.  So for example you can search for ‘Messages AND Wean’ in quotes from 1980-1999 and only find those that match whereas previously if an entry featured one quote with ‘messages’ and another with ‘wean’ it would get returned.  The screenshot below shows the new results.

There are a few things that I need to discuss with the team, though.  Firstly the ranking system.  As previously agreed upon, entries are ranked based on the proportion of quotes that contain the search term.  But this is possibly ranking entries that only have one quote too highly.  If there is only one quote and it features the term then 100% of quotes feature the term so the entry is highly ranked, but longer, possibly more important entries are ranked lower because (for example) out of 50 quotes 40 feature the term.  We might want to look into weighting entries that have more quotes overall.  For example, an SND quotation search for ‘prince’ (see below).  ‘Prince’ is ranked first, but then results 2-6 appear because they only have one quote, which happens to feature ‘prince’.

The second issues is that the new system cuts off quotations for entries after the tenth (as you can see for ‘Prince’, above).  We’d agreed on this approach to stop entries with lots of quotes swamping the results, but currently nothing is displayed to say that the results have been snipped.  We might want to add a note under the tenth quote.

The third issue is that the quote field in Solr is currently stemmed, meaning the stems of words are stored and Solr can then match alternative forms.  This can work well – for example the ‘messages AND wean’ results include results for ‘message’ and ‘weans’ too.  But it can also be a bit too broad.  See for example the screenshot below, which shows a quotation search for ‘aggressive’.  As you can see, it has returned quotations that feature ‘aggression’, ‘aggressively’ and ‘aggress’ in addition to ‘aggressive’.  This might be useful, but it might cause confusion and we’ll need to discuss this further at some point.

Next week I’ll hopefully start work on the filtering of search results for all search types, which will involve a major change to the way headword searches work and more big changes to the Solr indexes.

Also this week I investigated applying OED DOIs to the OED lexemes we link to in the Historical Thesaurus.  Each OED sense now has its own DOI that we can get access to, and I was sent a spreadsheet containing several thousand as an example.  The idea is that links from the HT’s lexemes to the OED would be updated to use these DOIs rather than performing a search of the OED for the work, which is what currently happens.

After a few hours of research I reckoned it would be possible to apply the DOIs to the HT data, but there are some things that we’ll need to consider.  The OED spreadsheet looks like it will contain every sense and the HT data does not, so much of the spreadsheet will likely not match anything in our system.  I wrote a little script to check the spreadsheet against the HT’s OED lexeme table and 6186 rows in the spreadsheet match one (or more) lexeme in the database table while 7256 don’t.  I also noted that the combination of entry_id and element_id (in our database called refentry and refid) is not necessarily unique in the HT’s OED lexeme table.  This can be if a word appears in multiple categories, plus there is a further ID called ‘lemmaid’ that was sometimes used to differentiate specific lexemes in combination with the other two IDs.  In the spreadsheet there are 1180 rows that match multiple rows in the HT’s OED lexeme table.  However, this also isn’t a problem and usually just means a word appears in multiple categories.  It just means that the same DOI would apply to multiple lexemes.

What is potentially a problem is that we haven’t matched up all of the OED lexeme records with the HT lexeme records.  While 6186 rows in the spreadsheet match one or more rows in the OED lexeme table, only 4425 rows in the spreadsheet match one or more rows in the HT’s lexeme table.  We will not be able to update the links to switch to DOIs for any HT lexemes that aren’t matched to an OED lexeme.  After checking I discovered that there are 87,713 non-OE lexemes in the HT lexeme table that are not linked to an OED lexeme.  None of these will be able to have a DOI (and neither will the OE words, presumably).

Another potential problem is that the sense an HT lexeme is linked to is not necessarily the main sense for the OED lexeme.  In such cases the DOI then leads to a section of the OED entry that is only accessible to logged in users of the OED site.  An example from the spreadsheet is ‘aardvark’.  Our HT lexeme links to entry_id 22, element_id 16201412, which has the DOI https://doi.org/10.1093/OED/1516256385 which when you’re not logged in displays a ‘Please purchase a subscription’ page.  The other entry for ‘aardvark’ in the spreadsheet has entry_id 22 and element_id 16201390, which has the DOI https://doi.org/10.1093/OED/9531538482 which leads to the summary page, but the HT’s link will be the first DOI above and not the second.  Note that currently we link to the search results on the OED site, which actually might be more useful for many people.  Aarkvark as found here: https://ht.ac.uk/category/?type=search&qsearch=aardvark&page=1#id=39313 currently links to this OED page: https://www.oed.com/search/dictionary/?q=aard-vark

To summarise:  I can update all lexemes in the HT’s OED lexeme table that match the entry_id and element_id columns in the spreadsheet to add in the relevant DOI.  I can also then ensure that any HT lexeme records linked to these OED lexemes also feature the DOI, but this will apply to less lexemes due to there still being many HT lexemes that are not linked.  I could then update the links through to the OED for these lexemes, but this might not actually work as well as the current link to search results due to many OED DOIs leading to restricted pages.  I’ll need to hear back from the rest of the team before I can take this further.

Also this week I had a meeting with Pauline Mackay and Craig Lamont to discuss an interactive map of Burns’ correspondents.  We’d discussed this about three years ago and the are now reaching a point where they would like to develop the map.  We discussed various options for base maps, data categorisation and time sliders and I gave them a demonstration of the Books and Borrowing project’s Chamber’s library map, which I’d previously developed (https://borrowing.stir.ac.uk/chambers-library-map/).  They were pretty impressed with this and thought it would be a good model for their map.  Pauline and Craig are now going to work on some sample data to get me started, and once I receive this I’ll be able to begin development.  We had our meeting in the café of the new ARC building, which I’d never been to before, so it was a good opportunity to see the place.

Also this week I fixed some issues with images for one of the library registers for the Royal High School for the Books and Borrowing project.  These had been assigned the wrong ID in the spreadsheet I’d initially used to generate the data and I needed to write a little script to rectify this.

Finally, I had a chat with Joanna Kopaczyk about a potential project she’s putting together.  I can’t say much about it at this stage, but I’ll probably be able to use the systems I developed last year for the Anglo-Norman Dictionary’s Textbase (see https://anglo-norman.net/textbase-browse/ and https://anglo-norman.net/textbase-search/).  I’m meeting with Joanna to discuss this further next week.

 

Week Beginning 14th August 2023

I was back at work this week after a lovely two-week holiday (although I did spend a couple of hours making updates to the Speech Star website whilst I was away).  After catching up with emails, getting back up to speed with where I’d left off and making a helpful new ‘to do’ list I got stuck into fixing the language tags in the Anglo-Norman Dictionary.

In June the editor Geert noticed that language tags had disappeared from the XML files of many entries.  Further investigation by me revealed that this probably happened during the import of data into the new AND system and had affected entries up to and including the import of R; entries that were part of the subsequent import of S had their language tags intact.  It is likely that the issue was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as part of the import process, as this script edits the XML.  Further testing revealed that the updated import workflow that was developed for S retained all language tags, as does the script that processes single and batch XML uploads as part of the DMS.  This means the error has been rectified, but we still need to fix the entries that have lost their language tags.

I was able to retrieve a version of the data as it existed prior to batch updates being applied to entry senses and from this I was able to extract the missing language tags for these entries.  I was also able to run this extraction process on the R data as it existed prior to upload.  I then ran the process on the live database to extract language tags from entries that featured them, for example entries uploaded during the import of S.  The script was also adapted to extract the ‘certainty’ attribute from the tags if present.  This was represented in the output as the number 50, separated from the language by a bar character (e.g. ‘Arabic|50’).  Where an entry featured multiple language tags these were separated by a comma (e.g. ‘Latin,Hebrew’).

Geert made the decision that language tags, which were previously associated with specific senses or subsenses, should instead be associated with entries as a whole.  This structural change will greatly simplify the reinstatement of missing tags and it will also make it easier to add language tags to entries that do not already feature them.

The language data that I compiled was stored in a spreadsheet featuring three columns: Slug: the unique form of a headword used in entry URLs; Live Langs: language tags extracted from the live database; Old Langs: language tags extracted from the data prior to processing.  A fourth column was also added where manual overrides to the preceding two columns could be added by Geert.  This column could also be used to add entries that did not previously have a language tag but needed one.

Two further issues were addressed at this stage.  The first related to compound words, where the language applied to one part of the word.  In the original data these were represented by combining the language with ‘A.F.’, for example ‘M.E._and_A.F.’.  Continuing with this approach would make it more difficult to search for specific languages and the decision was made to only store the non-A.F. language with a note that the word is a compound.  This was encoded in the spreadsheet with a bar character followed by ‘Y’.  To ensure the data could be more easily machine-readable the compound character would always be the third part of the language data, whether or not certainty was present in the second part.  For example ‘M.E.|50|Y’ represents a word that is possibly from M.E. and is a compound while ‘M.E.||Y’ represents a word that is definitely from M.E and is a compound.

The second issue to be addressed was how to handle entries that featured languages but whose language tags were not needed.  In such cases Geert added the characters ‘$$’ to the fourth column.

The spreadsheet was edited by Geert and currently features 2741 entries that are to be updated.  Each entry in the spreadsheet will be edited using the following workflow:

  1. All existing language tags in the entry will be deleted. These generally occur in senses or subsenses, but some entries feature them in the <head> element.
  2. If the entry has ‘$$’ in column 4 then no further updates will be made
  3. If there is other data in column 4 this will be used
  4. If there is no data in column 4 then data from column 2 will be used
  5. If there is no data in columns 4 or 2 then data from column 3 will be usedWhere there are multiple languages separated by a comma these will be split and treated separately.
  6. For each language the presence of a certainty value and / or a compound will be ascertained
  7. In the XML the new language tags will appear below the <head> tag.
  8. An entry will feature one language tag for each language specified
  9. The specific language will be stored in the ‘lang’ attribute
  10. Certainty (if present) will be stored in the ‘cert’ attribute which may only contain ‘50’ to represent ‘uncertain’.
  11. Compound (if present) will be stored in a new ‘compound’ attribute which may only contain ‘true’ to denote the word is a compound.
  12. For example, ‘Latin|50,Greek|50’ will be stored as two <language> tags beneath the <head> tag as follows: <language lang=”Latin” cert=”50” /><language lang=”Greek” cert=”50” /> while ‘M.E.||Y’ will be stored as: <language lang=”M.E.” compound=”true” />

I ran and tested the update on a local version of the data and the output was checked by Geert and me.  After backing up the live database I then ran the update on it and all went well.  The dictionary’s DTD also needed to be updated to ensure the new language tag can be positioned as an optional child element of the ‘main_entry’ element.  The DTD was also updated to remove language as a child of ‘sense’, ‘subsense’ and ‘head’.

Previously the DTD had a limited list of languages that can appear in the ‘lang’ attribute, but I’m uncertain whether this ever worked as the XML definitely included languages that were not in the list.  Instead I created a ‘picklist’ for languages that pulls its data from a list of languages stored in the online database.  We use this approach for other things such as semantic labels so it was pretty easy to set up.  I also added in the new optional ‘compound’ attribute.

With all of this in place I then updated the XSLT and some of the CSS in order to display the new language tags, which now appear as italicised text above any part of speech.  For example, an entry with multiple languages, one of which is uncertain: https://anglo-norman.net/entry/ris_3 and an entry that’s a compound with another language: https://anglo-norman.net/entry/rofgable.  Eventually I will update the site further to enable searches for language tags, but this will come at a later date.

Also this week I spent a bit of time in email conversations with the Dictionaries of the Scots Language people, discussing updates to bibliographical entries, the new part of speech system, DOST citation dates that were later than 1700 and making further tweaks to my requirements document for the date and part of speech searches based on feedback received from the team.  We’re all in agreement about how the new feature will work now, which means I’ll be able to get started on the development next week, all being well.

I also gave some advice to Gavin Miller about a new proposal he’s currently putting together, helped out Matthew Creasy with the website for his James Joyce Symposium website, spoke to Craig Lamont about the Burns correspondents project and checked how the stats are working on sites that were moved to our newer server a while back (all thankfully seems to be working fine).

I spent the remainder of the week implementing a ‘cite this page’ feature for the Books and Borrowing project, and the feature now appears on every page that features data.  A ‘Cite this page’ button appears in the right-hand corner of the page title.  Pressing the button brings up a pop-up containing citation options in a variety of styles.  I’ve taken this from other projects I’ve been involved with (e.g. the Historical Thesaurus) and we might want to tweak it, but at the moment something along the lines of the following is displayed (full URL crudely ‘redacted’ as the site isn’t live yet):

Developing this feature has taken a bit of time due to the huge variation in the text that describes the page.  This can also make the citation rather long, for example:

Advanced search for ‘Borrower occupation: Arts and Letters, Borrower occupation: Author, Borrower occupation: Curator, Borrower occupation: Librarian, Borrower occupation: Musician, Borrower occupation: Painter/Limner, Borrower occupation: Poet, Borrower gender: Female, Author gender: Female’. 2023. In Books and Borrowing: An Analysis of Scottish Borrowers’ Registers, 1750-1830. University of Stirling. Retrieved 18 August 2023, from [very long URL goes here]

I haven’t included a description of selected filters and ‘order by’ options, but these are present in the URL.  I may add filters and orders to the description, or we can just leave it as it is and let people tweak their citation text if they want.

The ‘cite this page’ button appears on all pages that feature data, not just the search results.  For example register pages and the list of book editions.  Hopefully the feature will be useful once the site goes live.

Week Beginning 19th June 2023

I continued to work for the Books and Borrowing project this week, switching the search facilities over to use a new Solr index that includes author gender.  It is now possible to incorporate author gender into searches, for example bringing back all borrowing records involving books written by women.  This will be a hugely useful feature.  I also fixed an issue with a couple of page images of a register at Leighton library that weren’t displaying.

The rest of my time this week was spent developing a new Bootstrap powered interface for the project’s website, which is now live (https://borrowing.stir.ac.uk/).  You’d struggle to notice any difference between this new version and the old one as the point of creating this new theme was not to change the look of the website but to make Bootstrap (https://getbootstrap.com/) layout options available to the dev site.  This will allow me to make improvements to the layout of things like the advanced search forms.  I haven’t made any such updates yet, but that is what I’ll focus on next.

It has taken quite a bit of time to get the new theme working properly – blog posts with ‘featured images’ that replace the site’s header image proved to be particularly troublesome to get working – but I think all is functioning as it should be now.  There are a few minor differences between the new theme and the old one.  The new theme has a ‘Top’ button that appears in the bottom right when you scroll down a long page, which is something I find useful.  The drop-down menus in the navbar look a bit different, as does the compact navbar shown on narrow screens.  All pages now feature the sidebar whereas previously some (e.g. https://borrowing.stir.ac.uk/libraries/) weren’t showing it.  Slightly more text is shown in the snippets on the https://borrowing.stir.ac.uk/project-news/ and other blog index pages.  Our title font is now used for more titles throughout the site.  I’ve also added in a ‘favicon’ for the site, which appears in the browser tab.  It’s the head of the woman second from the right in the site banner, although it is a bit indistinct.  My first attempt was the book held by the woman in the middle of the banner but this just ended up as a beige blob.

Next week I’ll update the layout of the dev site pages to use Bootstrap.  I’m going to be on holiday the week after next and at a conference the week after that so this might be a good time to share the URL for feedback, as other than adding in book genre when this is available everything else should pretty much be finished.

For the Anglo-Norman Dictionary this week I participated in a conference call to discuss collaborative XML editing environments.  The team are wanting to work together directly on XML files and to have a live preview of how these changes appear.  The team are investigating https://www.fontoxml.com/ and also https://paligo.net/ and https://www.xpublisher.com/en/xeditor.  However, none of these solutions give any mention whatsoever of pricing on their websites, which is incredibly frustrating and off-putting.  I also mentioned the DPS system that the DSL uses (https://www.idmgroup.com/content-management/dps-info.html).  We’ll need to give this some further thought.

I also spent a bit of time writing a script to extract language tags from the data.  The script goes through each ‘active’ entry in the online database and picks out all of the language tags from the live entry XML and stores each language and the number of times each language appears in each entry (across all senses, subsenses, locutions).  It does the same for the ‘dms_entry_old’ XML data (i.e. the data that was originally stored in the current system before any transformations or edits were made) for each of these ‘active’ entries (if the XML data exists) and similarly stores each language and frequency as ‘old’ languages.  In addition, the script goes through each of the ‘R’ XML files and picks out all language tags contained in them, augmenting in the list of ‘old’ languages.  For each ‘active’ entry that has at least one ‘live’ or ‘old’ language the script exports the slug and the ‘live’ and ‘old’ languages, consisting of each language found and the number of times found in the entry.  This data is then saved in a spreadsheet.

There are 1908 entries that will need to be updated and this update will consist of removing all language tags from each sense / subsense in each listed entry, adding a new language tag at entry level (probably below the <pos> tag) for each distinct language found, updating the DTD to make the newly positioned tags valid and updating the XSLT to ensure the new tags get displayed properly in the web pages.

I also began to think about how I’ll implement date / part of speech searches and sparklines in the Dictionaries of the Scots Language and have started writing a requirements document for the new features.  We had previously discussed adding the date search and filter options to quotations searches only, but subsequent emails from the editor suggest that these would be offered for other searches too and that in addition we would add in a ‘first attested’ search / filter option.

The quotation search will look for a term in the quotations, with each quotation having an associated date or date range.  Filtering the results by date will then narrow the results to only those quotations that have a date within the period specified by the filter.  For example, a quotation search limited to SND for the term ‘dreich’ will find 113 quotations.  Someone could then use the date filter to enter the years 1900-1950 which will then limit the quotations to 26 (those that have dates in this period).

At the moment I’m not sure how a ‘first attested’ filter would work for a quotation search.  A ‘first attested’ date is something that would be stored at entry rather than quotation level and would presumably be the start date of the earliest quotation.  So for example the SND entry for ‘Driech’ has an earliest quotation date of 1721 and we would therefore store this as the ‘first attested’ date for this entry.

This could be a very useful filter for entry searches and although it could perhaps be useful in a quotation search it might just confuse users.  E.g. the above search for the term ‘Dreich’ in SND finds 113 quotations.  A ‘first attested’ filter would then be used to limit these to quotations associated with entries that have a first attested date in the period selected by the user.  So for example if the user enters 1700-1750 in the ‘Dreich’ results then the 113 quotations would then be limited to those belonging to entries that were first attested in this period, which would include the entry ‘Dreich’.  But the listed quotations would still include all of those for the entry ‘Driech’ that include with search term ‘Dreich’ not just those from 1700-1750 because the limit was placed on entries with a first attested date in that period and not quotations found in that period.

In addition, the term searched for would not necessarily appear in the quotation that gave the entry its first attested date.  An entry can only have one first attested date and (in the case of a quotation search) the results will only display quotations that feature the search term, which will quite possibly not include the earliest quotation.  A search for quotations featuring ‘dreich’ in SND will not return the earliest quotation for the entry SND ‘Dreich’ as the form in this quotation is actually ‘dreigh’.

If we do want to offer date searching / filtering for all entry searches and not just quotation searches we would also have to consider whether we would then just store the dates of the earliest and last quotations to denote the ‘active’ period for the entry or whether we would need to take into account any gaps in this period as will be demonstrated by the sparklines.  If it’s the former then the ‘active’ period for SND ‘Dreich’ would be 1721-2000, so someone searching the full text of entries for the term ‘dreich’ and then entering ‘1960-1980’ as a ‘use’ filter will then still find this entry.  If it’s the latter than this filter would not find the entry as we don’t have any quotations between 1954 and 1986.

Also this week I had to spend a bit of time fixing a number of sites after a server upgrade stopped a number of scripts working.  It took a bit of time to track all of these down and fix them.  I also responded to a couple of questions from Dimitra Fimi of the Centre for Fantasy and the Fantastic regarding WordPress stats and mailing list software and discussed a new conference website with Matthew Creasy in English Literature.

Week Beginning 17th April 2023

On Monday this week I attended the Books and Borrowing conference in Stirling.  I gave a demonstration of the front-end I’ve been developing which I think went pretty well.  Everyone seems very pleased with the site and its features.

I also spent a bit of time working for the DSL.  I investigated an issue with square brackets in the middle of words which was causing the search to fail.  It would appear that Solr will happily ignore square brackets in the data when they are at the beginning or end of a word, but when they’re in the middle the word is then split into two.  We decided that in future we will strip square brackets out of the data before it gets ingested into Solr.  I also updated the site so that the show/hide quotations and etymology setting is no longer ‘remembered’ by the site when the user navigates between entries.  This was added in as a means of enabling users to tailor how they viewed the entries without having to change the view on every entry they looked at, but I can appreciate it might have been a bit confusing for people so it’s now been removed.  I also implemented a simple Cookie banner for the site and after consultation with the team this went live.

Also this week I responded to a query from Piotr Wegorowski in English Language and Linguistics about hosting podcasts and had an email chat with Mike Irwin, who started this week as head of IT for the College of Arts.  I spent the rest of the week working through the WordPress sites I manage and ensuring they all work via our new external host.  This involved fixing a number of issues with content management systems and APIs, updating plugins, creating child themes and installing new security plugins.  By the end of the week around 20 websites that had been offline since early February were back online again, including this one!

Week Beginning 6th March 2023

This week I continued to focus primarily on the development of the Books and Borrowing front-end, making various updates to the search and browse facilities.  I investigated an issue with authors not appearing in the Solr data and located the cache generation scripts I’d previously written but hadn’t re-run recently.  These include generating cached data for authors and also things like number of borrowings.  I have now updated my ‘Solr data generation how to’ document to note that these cache generation scripts need to be re-run so hopefully this shouldn’t be an issue in future.  I’ve also re-run all of the scripts now but this won’t affect anything until I send new data to the Stirling IT people.

I also updated my Solr generation scripts to split up edition languages and place of publication on the bar character (and also separating out any places in square brackets).  This means each language and place will be independently searchable in future.  I have also updated the advanced search form so that languages and places are split by the bar character.  For language this means that where a language had a bar (e.g. ‘Latin | Arabic’) the associated count then gets added to each of the individual occurrences of the language.  The list of languages on the advanced search form is now much more reasonable and less cluttered.

I was considering replacing the auto-complete for place of publication with a multi-select, but I’ve decided against this as even with the bar splitting there are still about 350 different places, which is too much.  Instead I’ve updated the autocomplete to bring back individual places.  For example previously ‘Stirling’ appeared in a long list of places all separated by bars and this is what was returned when you typed ‘Stirling’ in.  Now only ‘Stirling’ appears.

I then created a new API endpoint to retrieve the data for a single borrower and I now use this call when a user presses on a borrower name in the search results, enabling me to display the borrower name in the ‘You searched for’ section rather than just the borrower ID.  I have also updated things so that if you do press on a borrower name to search for all records involving the borrower, plus you can now press ‘refine your search’ and the borrower’s details (title, forename, surname, othernames) will populate the advanced search form, which is quite nice.  Note however that if you do this then press the ‘Search’ button this then searches on these fields and not the specific borrower ID, so the results may be very different (e.g. multiple John Smiths).

I also realised that I will need to rethink how the pubdate search works as I’d forgotten that we have both ‘pubdate’ and ‘pubdateend’ fields.  Currently I have just been using ‘pubdate’ but this is not going to give accurate results where we have a range of years. Instead I’m generating and saving each year in the range.  This allows the search to work without updating the code, and after testing it out all would appear to work very well.

I then worked on adding in more ‘click to search’ options to the search results page.  I added in a search option for ‘Holding title’ that when clicked on searches for the holding title’s ID (and populates the search form with the holding title if you choose to refine your search).  Unfortunately I realised that I had somehow not included the book holding ID as a field in the Solr data.  I therefore updated the schema on my laptop and regenerated and ingested the data into Solr on my laptop to check that the process worked.  Thankfully all went smoothly, but it does mean that until I next update the live Solr data the ‘click to search’ for holding title doesn’t actually provide any results.

I also added in ‘click to search’ for book edition and book work title.  I had included the book edition and work IDs in the Solr data so thankfully I was able to get these searches working fully.  As with book holding, when you click on a title to perform the search and then choose to refine your search the edition / work title appears in the advanced search form.

I also added in a similar option for authors.  You can now click on an author’s name at any of the levels of association and a search for author ID will be performed, with author forename and surname appearing in the advanced search form if you ‘refine’.

I then moved on to adding in ‘click to search’ for edition language and publication place.  These also had their own challenges as I needed to split up languages / publication places into individual clickable areas based on the bar character and also the square brackets for publication place.  I managed to get it all working, but of course this search won’t work properly until the Solr data is updated anyway.

Also this week I had an email conversation with Ophira Gamliel about a new project that will be starting soon.  We discussed how the project’s website will function, interactive maps, URLs, page design and other such matters.  I’ll be starting on this sometime after Easter.

Week Beginning 13th February 2023

This was a one-day week for me as I’d taken the Monday off to cover the school half-term holiday while Tuesday to Thursday were strike days.  Thankfully the strike days scheduled for the next two weeks have now been called off so I should be able to get a little more done.

I spent my one day of work continuing with the development of the front-end for the Books and Borrowing project, and managed to complete an initial version of the advanced search form.  In my previous update I said I hadn’t quite managed to complete the selection options for the borrower occupation section.  This is complete now – you can select and deselect occupations at any level and corresponding occupations at higher or lower levels of the hierarchy will also select / deselect as required.  I have also added in autocompletes for borrower settlement and street.  If you start typing into one of these boxes (e.g. ‘black’ in settlement) a list of matching options will appear from which you can select one.  Note that the entered text can appear anywhere in the settlement name (e.g. ‘parish of Blackford’ is brought back) and we might want to change this to just match the beginning of settlements.

Street works in the same way (e.g. type in ‘king’).  A couple of things to point out, though.  Firstly: the selection of settlement and of street are currently in no way connected.  E.g. if you select ‘Blackford’ as a settlement and then attempt to type in a street the system doesn’t limit this to just streets within Blackford.  I could update things to connect the two search boxes in such a way, though.  Secondly:  I think we’ll have to give people freedom to ignore the autocomplete if they want.  For example, if you enter ‘king’ in ‘street’ you’ll see lots of very specific addresses (e.g. ’15 great king street’).  If you select one of those you’re obviously limiting your search quite considerably.  Whereas if we allow people to enter ‘great king street’ to bring back all borrowings at all addresses on this street the search might be more useful.  I’ve also added in borrower gender which (as specified in the requirements document) allows one single gender to be selected.  Thinking about it, we might want to make this a multi-select like other things instead.

The book author section is exactly the same as the ‘simple’ search and the book work section is pretty straightforward (and still awaits the addition of genre).  In the book edition section the ESTC field is an autocomplete.  This works slightly differently in that it matches the beginning of the ESTC only (e.g. ‘T1001’ matches IDs beginning with that text) and three characters rather than two need to be entered.  Even three gives a very long list and I may make it four characters before the list appears.  The last autocomplete field is place of publication.  This matches text anywhere in the place.  For example, type in ‘lon’ and you’ll see all of the places involving London, but also places like Bouillon.  I did wonder about making this a multi-select instead, but there are possibly too many to list all at once.

There are also two further multi-select areas for language and format.  Format wasn’t listed in the requirements document as being a multi-select but I think it makes sense for it to be one.  Each of these areas lists the number of book editions that have the language / format and (as previously discussed) the data needs tidying up as it’s a bit messy.  So, that’s the ‘advanced’ form complete, although the layout is not finalised so it will eventually look a lot nicer (I hope).  But there’s no getting around the fact that there are an intimidatingly large number of search options listed and we might need to think some more about this.

Also this week I inserted a missing page into the records for a register for Leighton library and sent the data for the 62 borrowers that have been erroneously assigned the mid-tier ‘Minister/Priest’ occupation to Katie to be assigned a final tier occupation instead.

I also spoke to Matthew Creasy about a conference website he would like to put together and to Ophira Gamliel about some T4 issues, and to discuss an AHRC proposal she submitted a while back for which I wrote the Data Management Plan that has been successfully awarded funding.  This will begin sometime over the summer.

Next week I will continue to implement the advanced search for Books and Borrowing.

Week Beginning 9th January 2023

I attended the workshop ‘The impact of multilingualism on the vocabulary and stylistics of Medieval English’ in Zurich this week.  The workshop ran on Tuesday and Wednesday and I travelled to Zurich with my colleagues Marc Alexander and Fraser Dallachy on Monday.  It was really great to travel to a workshop in a different country again as I’d not been abroad since before Lockdown.  I’d never been to Zurich before and it was a lovely city.  The workshop itself was great, with some very interesting papers and good opportunities to meet other researchers and discuss potential future projects.  I gave a paper on the Historical Thesaurus, its categories and data structures and how semantic web technologies may be used to more effectively structure, manage and share the Historical Thesaurus’s semantically arranged dataset.  It was a half-hour paper with 10 minutes for questions afterwards and it went pretty well.  The audience wasn’t especially technical and I’m not sure how interesting the topic was to most people, but it was well received and I’m glad I had the opportunity to both attend the event and to research the topic as I have greatly increased my knowledge of semantic web technologies such as RDF, graph databases and SPARQL, and as part of the research I managed to write a script that generated an RDF version of the complete HT category data, which may come in handy one day.

I got back home just before midnight on the Wednesday and returned to normal work first thing on Thursday.  This included submitting my expenses from the workshop and replying to a few emails that had come in regarding my office (it looks like the dry rot work is going to take a while to resolve and it also looks like I’ll have to share my temporary office) and attempting to set up web hosting for the VARICS project, which Arts IT Support seem reluctant to do.  I also looked into an issue with the DSL that Ann Ferguson had spotted and spoke to the IT people at Stirling about their current progress with setting up a Solr instance for the Books and Borrowing project.  I also replaced a selection of library register images with better versions for that project and arranged a meeting for next Monday with the project’s PI and Co-I to discuss progress with the front-end.

I spent most of Friday writing a Data Management Plan and attending a Zoom call for a new speech therapy project I’m involved with.  It’s an ESRC funding proposal involving Glasgow and Strathclyde and I’ll be managing the technical aspects.  We had a useful call and I managed to complete an initial version of the DMP that the PI is going to adapt if required.

Week Beginning 19th December 2022

This was the last week before the Christmas holidays, and Friday was a holiday.  I spent some time on Monday making further updates to the Speech Star data.  I fixed some errors in the data and made some updates to the error type descriptions.  I also made ‘poster’ images from the latest batch of child speech videos I’d created last week as this was something I’d forgotten to do at the time.  I also fixed some issues with the non-disordered speech data, including changing a dash to an underscore in the filenames of the files for one speaker as there had been a mismatch between filenames and metadata, causing none of the videos to open in the site.  I also created records for two projects (The Gentle Shepherd and Speak For Yersel) on this very site (see https://digital-humanities.glasgow.ac.uk/projects/last-updated/) as these are the projects I’ve been working on that have actually launched in the past year.  Other major ones such as Books and Borrowing and Speech Star are not yet ready to share.  I also updated all of the WordPress sites I manage to the latest version.

On Tuesday I travelled into the University to locate my new office.  My stuff had been moved across last week after a leak in the building resulted in water pouring through my office.  Plus work is ongoing to fix the dry rot in the building and I would have needed to move out for that anyway.  It took a little time to get the new office in order and to get my computer equipment set up, but once it was all done it was actually a very nice location – much nicer than the horrible little room I’m usually stuck in.

I spent most of Tuesday upgrading Google Analytics for all of the sites I manage that use it.  Google’s current analytics system is being retired in July next year and I decided to use the time in the run-up to Christmas to migrate the sites over to the new Google Analytics 4 platform.  This was a mostly straightforward process, although as usual Google’s systems feel clunky and counterintuitive at times.  It was also a fairly lengthy process as I had to update the code for each site un question.  Nevertheless I managed to get it done and informed all of the staff whose websites would be affected by the change.  I also had a further chat with Geert, the editor of the Anglo-Norman Dictionary about the new citation edit feature I’m planning at the moment.

On Wednesday I had a meeting with prospective project partners in Strathclyde about a speech therapy proposal we’re putting together.  It was good to meet people and to discuss things.  I’ll be working on the Data Management Plan for the proposal after the holidays.  I spent the rest of the day working on my paper for the workshop I’m attending in Zurich in the second week of January.  I have now finished the paper, which is quite a relief.

On Thursday I spent some time working for the Dictionaries of the Scots Language.  I responded to an email from Ann Fergusson about how we should handle links to ancillary pages in the XML.  There are two issues here that need to be agreed upon.  The first issue is how to represent links to things other than entries in the entry XML.  We currently have the <ref> element that is used to link from one entry to another (e.g. <ref refid=”snd00065761″>Chowky</ref>).  We could use the HTML element <a> in the XML for links to things other than entries but I personally think it’s best not to use this as (in my opinion) it’s better for XML elements to be meaningful when you look at them and the meaning of <a> isn’t especially clear.  It might be better to use <ref> with a different attribute instead of ‘refid’, for example <ref url=”https://dsl.ac.uk/geographical-labels”>.  Reusing <ref> means we don’t need to update the DTD (the rules that define which elements can be used where in the XML) to add a new element.

Of course other people may think that inventing our own way of writing HTML links is daft when everyone is already familiar with <a href=”https://dsl.ac.uk/geographical-labels”> and we could use the latter if people prefer.  If this is the case we would need to update the DTD to allow such elements to be used.  If we didn’t update the DTD the XML files would fail to validate.

Whichever way is chosen, there is a second issue that will need to be addressed:  I will need to update the XSLT that transforms the XML into HTML to tell the script how to handle either a <ref> with a ‘url’ attribute or a <a> with an ‘href’ attribute.  Without updating the XSLT the links won’t work.  I can add such a rule in when we decide how best to represent links in the XML.

I also made a couple of tweaks to the wildcard search term highlighting feature I was working on last week and then published the update on the live DSL site.  Now when you perform a search for something like ‘chr*mas’ and then select an entry to view any work that matches the wildcard pattern will be highlighted.  For example, go to this page: https://dsl.ac.uk/results/chr*mas/fulltext/withquotes/both/ and then select one of the entries and you’ll see the term highlighted in the entry page.

That’s all from me for this year.  Merry chr*mas one and all!

Week Beginning 24th October 2022

I returned to work this week after having a lovely week’s holiday in the Lake District.  I spent most of the week working for the Books and Borrowing project.  I’d received images for two library registers from Selkirk whilst I was away and I set about integrating them into our system.  This required a bit of work to get the images matched up to the page records for the registers that already exist in the CMS.   Most of the images are double-pages but the records in the CMS are of single pages marked as ‘L’ or ‘R’.  Not all of the double-page images have both ‘L’ and ‘R’ in the CMS and some images don’t have any corresponding pages in the CMS.  For example in Volume 1 we have ‘1010199l’ followed by ‘1010203l´followed by ‘1010205l’ and then ‘1010205r’.  This seems to be quite correct as the missing pages don’t contain borrowing records.  However, I still needed to figure out how to match up images and page records.  As with previous situations, the options were either slicing the images down the middle to create separate ‘L’ and ‘R’ images to match each page or joining the ‘L’ and ‘R’ page records in the CMS to make one single record that then matches the double-page image.  There are several hundred images so manually chopping them up wasn’t really an option, and automatically slicing them down the middle wouldn’t work too well as the page divide is often not in the centre of the image.  This then left joining up the page records in the CMS as the best option and I wrote a script to join the page records, rename them to remove the ‘L’ and ‘R’ affixes, moving all borrowing records across and renumbering their page order and then deleting the now empty pages.  Thankfully it all seemed to work well.  I also uploaded the images for the final register from the Royal High School, which thankfully was a much more straightforward process as all image files matched references already stored in the CMS for each page.

I then returned to the development of the front-end for the project.  When I looked at the library page I’d previously created I noticed that the interactive map of library locations was failing to load.  After a bit of investigation I realised that this was caused by new line characters appearing in the JSON data for the map, which was invalidating the file structure.  These had been added in via the library ‘name variants’ field in the CMS and were appearing in the data for the library popup on the map.  I needed to update the script that generated the JSON data to ensure that new line characters were stripped out of the data, and after that the maps loaded again.

Before I went on holiday I’d created a browse page for library books that split the library’s books up based on the initial letter of their titles.  The approach I’d taken worked pretty well, but St Andrews was still a bit of an issue due to it containing many more books than the other libraries (more than 8,500).  Project Co-I Matt Sangster suggested that we should omit some registers from the data as their contents (including book records) are not likely to be worked on during the course of the project.  However, I decided to just leave the data in place for now, as excluding data for specific registers would require quite a lot of reworking of the code.  The book data for a library is associated directly with the library record and not the specific registers and all the queries would need to be rewritten to check which registers a book appears in.  I reckon that if these registers are not going to be tackled by the project it might be better to just delete them, not just to make the system work better but to avoid confusing users with messy data, but I decided to leave everything as it is for now.

This week I added in two further ways of browsing books in a selected library:  By author and by most borrowed.  A drop-down list featuring the three browse options appears at the top of the ‘Books’ page now, and I’ve added in a title and explanatory paragraph about the list type. The ‘by author’ browse works in a similar manner to the ‘by title’ browse, with a list of initial letter tabs featuring the initial letter of the author’s surname and a count of the number of books that have an author with a surname beginning with this letter.  Note that any books that don’t have an associated author do not appear in this list.  I did think about adding a ‘no author’ tab as well, but some libraries (e.g. St Andrews) have so many books without specified authors that the data for this tab would take far too long to load in.  Note also that if a book has multiple authors then the book will appear multiple times – once for each author.  Here’s a screenshot of how the interface currently looks:

The actual list of books works in a similar way to the ‘title’ list but is divided by author, with authors appearing with their full name and dates in red above a list of their books.  The ordering of the records is by author surname then forename then author ID then book title.  This means two authors with the same name will still appear as separate headings with their books ordered alphabetically.  However, this has also uncovered some issues with duplicate author records.

Getting this browse list working actually took a huge amount of effort due to the complex way we store authors.  In our system an author can be associated with any one of four levels of book record (work / edition / holding / item) and an author associated at a higher level needs to cascade down to lower level book records.  Running queries directly on this structure proved to be too resource intensive and slow so instead I wrote a script to generate cached data about authors.  This script goes through every author connection at all levels and picks out the unique authors that should be associated with each book holding record.  It then stores a reference to the ID of the author, the holding record and the initial letter of the author’s surname in a new table that is much more efficient to reference.  This then gets used to generate the letter tabs with the number of book counts and to work out which books to return when an author surname beginning with a letter is selected.

However, one thing we need to consider about using cached tables is that the data only gets updated when I run the script to refresh the cache, so any changes / additions to authors made in the CMS will not be directly reflected in the library books tab.  This is also true of the ‘browse books by title’ lists I previously created too.  I noticed when looking at the books beginning with ‘V’ for a library (I can’t remember which) that one of the titles clearly didn’t begin with a ‘V’, which confused me for a while before I realised it’s because the title must have been changed in the CMS since I last generated the cached data.

The ’most borrowed’ page lists the top 100 most borrowed books for the library, from most to least borrowed.  Thankfully this was rather more straightforward to implement as I had already created the cached fields for this view.  I did consider whether to have tabs allowing you to view all of the books by number of borrowings, but I wasn’t really sure how useful this would be.  In terms of the display of the ‘top 100’ the books are listed in the same way as the other lists, but the number of borrowings is highlighted in red text to make it easier to see.  I’ve also added in a number to the top-left of the book record so you can see which place a book has in the ‘hitlist’, as you can see in the following screenshot:

I also added in a ‘to top’ button that appears as you scroll down the page (it appears in the bottom right, as you can see in the above screenshot).  Clicking on this scrolls to the page title, which should make the page easier to use – I’ve certainly been making good use of the button anyway.

Also this week I submitted my paper ‘Speak For Yersel: Developing a crowdsourced linguistic survey of Scotland’ to DH2023.  As it’s the first ‘in person’ DH conference to be held in Europe since 2019 I suspect there will be a huge number of paper submissions, so we’ll just need to see if it gets accepted or not.  Also for Speak For Yersel I had a lengthy email conversation with Jennifer Smith about repurposing the SFY system for use in other areas.  The biggest issue here would be generating the data about the areas:  settlements for the drop-down lists, postcode areas with GeoJSON shape files and larger region areas with appropriate GeoJSON shape files.  It took Mary a long time to gather or create all of this data and someone would have to do the same for any new region.  This might be a couple of weeks of effort for each area.  It turns out that Jennifer has someone in mind for this work, which would mean all I would need to do is plug in a new set of questions, work with the new area data and make some tweaks to the interface.  We’ll see how this develops.  I also wrote a script to export the survey data for further analysis.

Another project I spent some time on this week was Speech Star.  For this I created a new ‘Child Speech Error Database’ and populated it with around 90 records that Eleanor Lawson had sent me.  I imported all of the data into the same database as is used for the non-disordered speech database and have added a flag that decides which content is displayed in which page.  I removed ‘accent’ as a filter option (as all speakers are from the same area) and have added in ‘error type’.  Currently the ‘age’ filter defaults to the age group 0-17 as I wasn’t sure how this filter should work, as all speakers are children.

The display of records is similar to the non-disordered page in that there are two means of listing the data, each with its own tab.  In the new page these tabs are for ‘Error type’ and ‘word’.  I also added in ‘phonemic target’ and ‘phonetic production’ as new columns in the table as I thought it would be useful to include these, and I updated the video pop-up for both the new page and the non-disordered page to bring it into line with the popup for the disordered paediatric database, meaning all metadata now appears underneath the video rather than some appearing in the title bar or above the video and the rest below.  I’ve ensured this is exactly the same for the ‘multiple video’ display too.  At the moment the metadata all just appears on one long line (other than speaker ID, sex and age) so the full width of the popup is used, but we might change this to a two-column layout.

Later in the week Eleanor got back to me to say she’d sent me the wrong version of the spreadsheet and I therefore replaced the data.  However, I spotted something relating to the way I structure the data that might be an issue.  I’d noticed a typo in the earlier spreadsheet (there is a ‘helicopter’ and a ‘helecopter’) and I fixed it, but I forgot to fix it before uploading the newer file.   Each prompt is only stored once in the database, even if it is used by multiple speakers so I was going to go into the database and remove the ‘helecopter’ prompt row that didn’t need to be generated and point the speaker to the existing ‘helicopter’ prompt.  However, I noticed that ‘helicopter’ in the spreadsheet has ‘k’ as the sound whereas the existing record in the database has ‘l’.  I realised this is because the ‘helicopter’ prompt had been created as part of the non-disordered speech database and here the sound is indeed ‘l’.  It looks like one prompt may have multiple sounds associated with it, which my structure isn’t set up to deal with.  I’m going to have to update the structure next week.

Also this week I responded to a request for advice from David Featherstone in Geography who is putting together some sort of digitisation project.  I also responded to a query from Pauline Graham at the DSL regarding the source data for the Scots School Dictionary.  She wondered whether I had the original XML and I explained that there was no original XML.  The original data was stored in an ancient Foxpro database that ran from a CD.  When I created the original School Dictionary app I managed to find a way to extract the data and I saved it as two CSV files – one English-Scots the other Scots-English.  I then ran a script to convert this into JSON which is what the original app uses.  I gave Pauline a link to download all of the data for the app, including both English and Scots JSON files and the sound files and I also uploaded the English CSV file in case this would be more useful.

That’s all for this week.  Next week I’ll fix the issues with the Speech Star database and continue with the development of the Books and Borrowing front-end.