Week Beginning 18th September 2023

On Monday and Tuesday this week I participated in the UCU strike action.  On my return to work on Wednesday I focussed on writing a Data Management Plan for Jennifer Smith’s ESRC proposal that uses some of the SCOSYA data.  After a few follow-up conversations I completed a version of the plan that Jennifer was happy with.  I informed her that I’d be happy to help out with any further changes or discussions, but other than that my involvement is now complete.

I spent a fair bit of the remainder of the week trying to fix an old resource.  I created the House of Fraser Archive site (https://housefraserarchive.ac.uk/) with Graeme Cannon more than twelve years ago, with Graeme doing the XML parts via an eXist-DB system and me doing the interface and all of the parts that processed and displayed data returned from eXist.  Unfortunately the server the site was running on had to be taken offline and the system moved elsewhere.  A newer version of eXist was required and the old libraries that were used to connect to the XML database would no longer work.  I figured out a way to connect via an alternative method, but this then returned the data in a different structure.  This meant I needed to update every page of the site that processed data to not only update the way the system was queried but also update the way the returned data was handled.  This took quite a lot of time but I managed to get all of the ‘browse’ options plus the display of records, tags and images working.  The only thing I couldn’t get to work was the search, as this seems to use further libraries that are no longer available.  I the issue is structuring the query to work with eXist, but I am not much of an expert with eXist and I’m not really sure how to untangle things.  I’ve asked Luca if he could have a look at it, as he’s use eXist a lot more than I have.  I’ve not heard back from him yet, but hopefully we’ll manage to get the search working, otherwise we may have to remove the search and get people to rely on the browse functions to access the data instead.

For the rest of the week I returned to working on the Books and Borrowing project.  One thing on my ‘to do’ list is to sort out the API.  There are a few endpoints that I haven’t documented yet, plus the existing documentation and structuring of the API could be improved.  I spent some time adding in a license statement and a ‘table of contents’ that lists all endpoints.  I’m currently in the middle of adding in the missing endpoint descriptions.  After that I’ll need to ensure the examples given all work and make sense and then I need to ensure the CSV output works properly for all data types.  I’m fairly certain that some data held in arrays will not output properly as CSV at the moment and this definitely needs sorted.

Week Beginning 11th September 2023

I spent a fair amount of time this week preparing for my PDR session – bringing together information about what I’ve done over the past year and filling out the necessary form.  I also had a meeting with Jennifer Smith to discuss an ESRC proposal she’s putting together using some of the data from the SCOSYA project and then spent some further time after the meeting researching some tools the project might use and reading the Case for Support.

I also spent a bit of time working for the Anglo-Norman Dictionary, updating the XSLT file to better handle varlists in citations.  So for example instead of:

( MS: s.xiiiex )  Satureia: (A6) gallice savoroye (var. saveray  (A9) MS: c.1300 ;  saveroy  (A12) MS: s.xiii4/4 ;  savoreie  (B3) MS: s.xiv4/4 ;  savoré  (C35) MS: s.xv )  Plant Names 230

we’d have:

( MS: s.xiiiex )  Satureia: (A6) gallice savoroye (var.  (A9: c.1300) saveray; (A12: s.xiii4/4) saveroy; (B3: s.xiv4/4) savoreie;  (C35: s.xv) savoré)  Plant Names 230

I completed an initial version of the update using test files and after discussions with the editor Geert and a few minor tweaks the update went live on Wednesday.

I also spent a bit of time working to fix the House of Fraser Archive website, which I created with Graeme Cannon many moons ago.  It uses an eXist XML database but needed to be migrated to a new server with more modern versions due to security issues.  I spent some time figuring out how to connect to the new eXist database and had just managed to find a solution when the server went down and I was unable to access it.  It was still offline at the end of the week, which is a bit frustrating.

I also made a couple of minor tweaks to a conference website for Matthew Creasy and gave some advice to Ewan Hannaford about adding people to a mailing list.  My updates to the DSL also went live this week on the DSL’s test server, and I emailed the team a detailed report of the changes, highlighting points for discussion.  I’m sure I’ll need to make a number of changes to the features I’ve developed over the past few weeks once the team have had a chance to test things out.  We’ll see what they say once they get back to me.

I was also contacted this week by Eleanor Lawson with a long list of changes she wanted me to make to the two Speech Star websites.  Many of these were minor tweaks to text, but there were some larger issues too.  I needed to update the way sound filters appear on the website in order to group different sounds together and to ensure the sounds always appear in the correct order.  This was a pretty tricky thing to accomplish as the filters are automatically generated and vary depending on what other filter options the user has selected.  It took a while to get working, but I got there in the end, thankfully.  Eleanor had also sent me a new set of videos that needed to be added to the Edinburgh MRI Modelled Speech Corpus.  These were chunks of some of the existing videos as a decision had been made that splitting them up would be more useful for users.  I therefore had to process the videos and add all of the required data for them to the database.  All is looking good now, though.

Next week I’ll be participating in the UCU strike action on Monday and Tuesday so it’s going to be a short week for me.

Week Beginning 4th September 2023

I continued with the new developments for the Dictionaries of the Scots Language for most of this week, focussing primarily on implementing the sparklines for dates of attestation.  I decided to use the same JavaScript library as I used for the Historical Thesaurus (https://omnipotent.net/jquery.sparkline) to produce a mini bar chart for the date range, with either a 1 when a date is present or a zero where a date is not present.  In order to create the ranges for an entry all of the citations that have a date for the entry are returned in date order.  For SND the sparkline range is 1700 to 2000 and for DOST the range is 1050 to 1700.  Any citations with dates beyond this are given a date of the start or end as applicable.  Each year in the range is created with a zero assigned by default and then my script iterates through the citations to figure out which of the years needs to be assigned a 1, taking into consideration citations that have a date range in addition to ones that have a single year.  After that my script iterates through the years to generate blocks of 1 values where individual 1s are found 25 years or less from each other, as I’d agreed with the team, in order to make continuous periods of usage.  My script also generates a textual representation of the blocks and individual years that is then used as a tooltip for the sparkline.

I’d originally intended each year in the range to then appear as a bar in the sparkline, with no gaps between the bars in order to make larger blocks, but the bar chart sparkline that the library offers has a minimum bar width of 1 pixel.  As the DOST period is 650 years this meant the sparkline would be 650 pixels wide.  The screenshot below shows how this would have looked (note that in this and the following two screenshots the data represented in the sparklines is test data and doesn’t correspond to the individual entries):

I then tried grouping the individual years into bars representing five years instead.  If a 1 was present in a five-year period then the value for that five year block was given a 1, otherwise it was given a 0.  As you can see in the following screenshot, this worked pretty well, giving the same overall view of the data but in a smaller space.  However, the sparklines were still a bit too long.  I also added in the first attested date for the entry to the left of the sparkline here, as was specified in the requirements document:

As a further experiment I grouped the individual years into bars representing a decade, and again if a year in that decade featured a 1 the decade was assigned a 1, otherwise it was assigned a 0.  This resulted in a sparkline that I reckon is about the right size, as you can see in the screenshot below:

With this in place I then updated the Solr indexes for entries and quotations to add in fields for the sparkline data and the sparkline tooltip text.  I then updated my scripts that generated entry and quotation data for Solr to incorporate the code for generating the sparklines, first creating blocks of attestation where individual citation dates were separated by 25 years or less and then further grouping the data into decades.  It took some time to get this working just right.  For example, on my first attempt when encountering individual years the textual version was outputting a range with the start and end year the same (e.g. 1710-1710) when it should have just outputted a single year.  But after a few iterations the data outputted successfully and I imported the new data into Solr.

With the sparkline data in Solr I then needed to update the API to retrieve the data alongside other data types and after that I could work with the data in the front-end, populating the sparklines for each result with the data for each entry and adding in the textual representation as a tooltip.  Having previously worked with a DOST entry as a sample, I realised at this point that as the SND period is much shorter (300 years as opposed to 650) the SND sparklines would be a lot shorter (30 pixels as opposed to 65).  Thankfully the sparkline library allows you to specify the width of the bars as each sparkline is generated and I set the width of SND bars to two pixels as opposed to the one pixel for DOST, making the SND sparklines a comparable 600 pixels wide.  It does mean that the visualisation of the SND data is not exactly the same as for DOST (e.g. an individual year is represented as 2 pixels as opposed to 1) but I think the overall picture given is comparable and I don’t think this is a problem – we are just giving an overall impression of periods of attestation after all.  The screenshot below shows the search results with the sparklines working with actual data, and also demonstrates a tooltip that displays the actual periods of attestation:

At this point I spotted another couple of quirks that needed to be dealt with.  Firstly, we have some entries that don’t feature any citations that include dates.  These understandably displayed a blank sparkline.  In such cases I have updated the tooltip text to display ‘No dates of attestation currently available’.  Secondly, there is a bug in the sparkline library that means an empty sparkline is displayed if all data values are identical.  Having spotted this I updated my code to ensure a full block of colour was displayed in the sparkline instead of white.

With the sparklines in the search results now working I then moved onto the display of sparklines in the entry page.  I wasn’t entirely sure where was the best place to put the sparkline so for now I’ve added it to the ‘About this entry’ section.  I’ve also added in the dates of attestation to this section too.  This is a simplified version showing the start and end dates.  I’ve used ‘to’ to separate the start and end date rather than a dash because both the start and end dates can in themselves be ranges.  This is because here I’m using the display version of the first date of the earliest citation and the last date of the latest citation (or first date if there is no last date).  Note that this includes prefixes and representations such as ’15..’.  The sparkline tooltip uses the raw years only.  You can see an entry with the new dates and sparkline below:

The design of the sparklines isn’t finalised yet and we may choose to display them differently.  For example, we don’t need to use the purple I’ve chosen and we could have rounded ends.  The following screenshot shows the sparklines with the blue from the site header as a bar colour and rounded ends.  This looks quite pleasing, but rounded ends do make it a little more difficult to see the data at the ends of the sparkline.  See for example DOST ‘scunner n.’ where the two lines at the very right of the sparkline are a bit hard to see.

I also managed to complete the final task in this block of work for the DSL, which was to add in links to the search results to download the data as a CSV.  The API already has facilities to output data as a CSV, but I needed to tweak this a bit to ensure the data was exported as we needed it.  Fields that were arrays were not displaying properly and certain fields needed to be supressed.   For other sites I’ve developed I was able to link directly to the API’s CSV output from the front-end but the DSL’s API is not publicly accessible so I had to do things a littler differently here.  Instead pressing on the ‘download’ link fires an AJAX call to a PHP script that passes the query string to the API without exposing the URL of the API, then takes the CSV data and presents it as a downloadable file.  This took a bit of time to sort out as the API was in itself offering the CSV as a downloadable file and this wasn’t working when being passed to another script.  Instead I had to set the API to output the CSV data on screen, meaning the scripts called via AJAX could then grab this data and process it.

With all of this working I put in a Helpdesk request to get the Solr instances set up and populated on the server and I then copied all of the updated files to the DSL’s test instance.  As of Friday the new Solr indexes don’t seem to be working but hopefully early next week everything will be operational.  I’ll then just need to tweak the search strings of the headword search so that the new Solr headword search matches the existing search.

Also this week I had a chat with Thomas Clancy about the development of the front-end for the Iona place-names project.  About a year ago I wrote a specification for the front-end but never heard anything further about it, but it looks like development will be starting soon.  I also had a chat with Jennifer Smith about the data for the Speak For Yersel spin-off projects and it looks like this will be coming together in the next few weeks too.  We also discussed another project that may use the data from SCOSYA and I might have some involvement in this.

Other than that I spent a bit of time on the Anglo-Norman Dictionary, creating a CSS file to style the entry XML in the Oxygen XML editor’s ‘Author’ view.  The team are intending to use this view to collaborate on the entries and previously we hadn’t created any styles for it.  I had to generate styles that replicated the look of the online dictionary as much as possible, which took some time to get right.  I’m pretty happy with the end result, though, which you can see in the following screenshot:

Week Beginning 28th August 2023

I spent pretty much the whole week working on the new date facilities for the Dictionaries of the Scots Language.  I have now migrated the headword search to Solr, which was a fairly major undertaking, but was necessary to allow headword searches to be filtered.  I decided to create a new Solr core for DSL entries that would be used by both the headword search and the fulltext / fulltext with no quotes searches.  This made sense because I would otherwise have needed to update the Solr core for fulltext to add the additional fields needed for filtering anyway.  With the new core in place I then updated the script I wrote to generate the Solr index to include the new fields (e.g. headword, forms, dates) and generated the data, which I then imported into Solr.

With the new Solr core populated with data I then updated the API to work with it, replacing the existing headword search which queried the database with a new search than instead connects to Solr.  As all of the fields that get returned by a search are now stored in Solr the database no longer needs to be queried at all, which should make things faster.  Previously the fulltext search queried the Solr index and then once entries were returned the database was then queried for each of these to add in the other necessary fields, which was a bit inefficient.

With the API updated the website (still only the version running on my laptop) then automatically used the new Solr index for headword searches: the quick search, the predictive search and the headword search in the advanced search.  I did some tests comparing the site on my laptop to the live site and things are looking good, but I will need to tweak the default use of wildcards.  The new headword search matches exact terms by default, which is the equivalent on the live site of surrounding the term with quotes, (something the quick search does by default anyway).  I can’t really tweak this easily until I move the new site to our test server, though, as Windows (which my laptop uses) can’t cope with asterisks and quotes in filenames, which means the website on my laptop breaks if a URL includes these characters.

With the new headword and fulltext index in place I then moved on to implementing the date filtering options.  In order to do so I realised I would also have to add the entry attestation dates (from and to) to the quotation index as well, as a ‘first attested’ filter on a quotation search will use these dates.  This meant updating the quotation Solr index structure, tweaking my script that generates the quotation data for Solr, running the script to output the data and then ingesting this into Solr, all of which took some time.

I then worked with the Solr admin interface to figure out how to perform a filter query for both first attestation and the ‘in use’ period.  ‘First attested’ was pretty straightforward as a single year is queried.  Say the year is 1658 a filter would bring it back if the filter was a single year that matched (i.e. 1658) or a range that contained the year (e.g. 1650-1700).  The ‘in use’ filter was much more complex to figure out as the data to be queried is a range.  If the attestation period is 1658-1900 and a single year filter is given (e.g. 1700) then we need to check whether this year is within the range.  What is more complicated is when the filter is also a range.  E.g. 1600-1700 needs to return the entry even though 1600 is less than 1658 and 1850-2000 needs to return the entry even though 2000 is greater than 1900.  1600-2000 also needs to return the entry even though both ends extend beyond the period, but 1660-1670 also needs to return the entry as both ends are entirely within the period.

The answer to this headache-inducing problem was to run a query that checked whether the start date of the filter range was less than the end date of the attestation range and the end date of the filter range was greater than the start date of the attestation range.  So for example the attestation range is 1658-1900.  Filter range 1 is 1600-1700.  1600 is less than 1900 and 1700 is greater than 1658 so the entry is returned.  Filter range 2 is 1850-2000.  1850 is less than 1900 and 200 is greater than 1658 so the entry is returned.  Filter range 3 is 1600-2000.  1600 is less than 1900 and 2000 is greater than 1658 so the entry is returned.  Filter range 4 is 1660-1670.  1660 is less and 1900 and 1670 is greater than 1658 so the entry is returned.  Filter range 5 is 1600-1650.  1600 is less than 1900 but 1650 is not greater than 1658 so the entry is not returned.  Filter range 6 is 1901-1950.  1901 is not less than 1900 but 1950 is greater than 1658 so the entry is not returned.

Having figured out how to implement the filter query in Solr I then needed to update the API to take filter query requests, process them, format the query and then pass this to Solr.  This was a pretty major update and took quite some time to implement, especially as the quotation search needed to be handled differently for the ‘in use’ search, which as agreed with the team was to query the dates of the individual quotations rather than the overall period of attestation for an entry.  I managed to get it all working, though, allowing me to pass searches to the API by changing variables in a URL and filter the results by passing further variables.

With this in place I could then update the front-end to add in the option of filtering the results.  I decided to add the option as a box above the search results.  Originally I was going to place it down the left-hand side, but space is rather limited due to the two-column layout of a search result that covers both SND and DOST.  The new ‘Filter the results’ box consists of buttons for choosing between ‘First attested’ and ‘In use’ and ‘from’ and ‘to’ boxes where years can be entered.  There is also an ‘Update’ button and a ‘clear’ button.  Supplying a filter and pressing ‘Update’ reloads the page with the results filtered based on your criteria.  It will be possible to bookmark or cite a filtered search as the filters are added to the page URL.  The filter box appears on all search results pages, including the quick search, and seems to be working as intended.

So for example the screenshot below shows a filtered fulltext search for ‘burn’.  Without the filter this brings back more results than we allow, but if you filter the results to only those that are first attested between 1600 and 1700 a more reasonable number is returned, as the screenshot shows:

The second screenshot shows the entries that were in use during this period rather than first attested, which as you can see gives a larger number of results:

As mentioned, the ‘in use’ filter works differently for quotations, limiting those that are displayed to ones in the filter period.  The screenshot below shows an ‘in use’ filter of 1650 to 1750 for a quotation search for ‘burn’:

The filter is ‘remembered’ when you navigate to an entry and then use the ‘back to search results’ button.  You can clear the filter by pressing on the ‘clear’ button or deleting the years in the ‘from’ and ‘to’ boxes and pressing ‘update’.  Previously if there was only one search result the results page would automatically redirect to the entry.  This was also happening when an applied filter only gave one result, which I found very annoying so instead if a filter is present and only one result is returned the results page is still displayed instead. Next week I’ll work on adding the first attested dates to the search results and I’ll also begin to develop the sparklines.

Other than this I had a meeting with Joanna Kopaczyk to further discuss a project she’s putting together.  It looks like it will be a fairly small pilot project to begin with and I’ll only be involved in a limited capacity, but it has great potential.