Week Beginning 15th April 2024

The Books and Borrowing project has its official launch next week, and I spent some time this week preparing for it.  This included fixing a speed issue with the site-wide fact page (https://borrowing.stir.ac.uk/facts/) that was taking far too long to load its default view, something that has somehow got worse since I originally created the feature.  After undertaking some investigation into what was causing the slow loading time it turned out that the main sticking point was the loading of data for the two ‘Borrowings through time’ visualisations.  This was taking quite a long time to calculate for all libraries.  I therefore created cached files for the data used for all libraries and have set the API to call on these rather than querying the database whenever specific libraries are not requested.  This has greatly speeded up the loading of the page, which (for me at least) is now practically instantaneous, while before it was taking up to 30 seconds.

I also went through all of my blog posts to extract text relating to the project going back to create a single document.  It’s 130 pages long and contains more than 56,000 words, covering all of the work I did for the project over the course of four years.  I then spent some time preparing a talk I’ll give at the launch about the development of the resource.  I haven’t finished working on this yet and will continue with it next week.  I also met with Matt to discuss the British Association for Romantic Studies journal (https://www.bars.ac.uk/review/index.php/barsreview) that needs and overhaul.  I’m going to work on this for Matt and hopefully get an updated version in place before the end of May.

Also this week I continued to work for the Speech Star project, adding another batch of videos to the website.  This took a fair amount of time to implement as the videos needed to get added to a variety of different pages.

I then returned to working on the Speak For Yersel follow-on project, focussing on Wales this time.  I needed to generate a top-level ‘regions’ geojson file that amalgamated the area polygons into 22 larger regions.  I did this in QGIS, manually selecting and merging the areas.  With the regions file in place I could then set up the Wales survey using my survey setup script.  This all went pretty smoothly, although there were some inconsistencies between the area geojson file and the settlements spreadsheet that I needed to fix before I could get the survey to setup successfully.

I’d also received feedback about the Republic of Ireland and Northern Ireland surveys, including many changes that needed to be made to the survey questions.  I decided that the easiest way to handle this would be to delete the surveys (which were still only test versions with no real data) and start again with updated question / answer spreadsheets.  There were some other updates that had been suggested that would apply to all surveys and I updated all three sites to incorporate these.  I then spent a bit of time investigating QGIS and whether it could be used to create simpler versions of the region polygons in order to generate a logo for each of our new sites that would be analogous to the Scottish survey.  After discussing this with Jennifer and Mary we agreed that Mary would take this forward, so we’ll hopefully see the outcome next week.

Week Beginning 8th April 2024

I’d taken a day off this week, so only worked four.  I began the week continuing to work on the Anglo-Norman Dictionary, making a tweak to the publications scripts I was working on last week and then planning a new search of language tags that the editor wanted to be added.  Language tags are at entry level (i.e. they apply to the whole entry) and are used to denote loanwords.

There are only 2660 entries that currently feature the language tag (and 24,762 that don’t) so the search is going to be fairly limited.  I explored two possible developments.  Firstly, we could have a separate tab on the ‘Advanced Search’ page for ‘Language’, as we do for ‘Semantic & Usage Labels’.  The new tab would work in a similar way to this (see https://anglo-norman.net/search/), with a list of languages and a count of the number of associated entries.  We could either then make a search run as soon as a language is clicked on, or we could allow multiple languages to be selected and then joined with Booleans as with the ‘Label’ search.  The latter would be more consistent, but I’m not sure how useful it would be as there aren’t many entries that have multiple language tags (so ‘AND’ and ‘NOT’ would not be so helpful).  I guess ‘OR’ would be more useful, but the user could just perform separate searches as there won’t be huge number of results anyway.

Secondly, we could add a language selector to the ‘Headwords & Forms’ search tab, underneath ‘Citation date’.  We could provide a list of languages (including a note explaining how the languages are used and that they are not widely applied), with each language appearing as a checkbox (checking multiple will act as ‘OR’).  The language search could then be used on its own (leaving ‘Headword’ and ‘Citation date’ blank) or in conjunction with the other search options.

So for example

  1. Retrieve all of the words of Scandinavian origin
  2. Retrieve all of the words of Scandinavian origin that have a headword / form beginning ‘sc’
  3. Retrieve all of the words of Scandinavian origin that have a headword / form beginning ‘sc’ whose entries feature a citation with a date between 1400 and 1450

After further consultation with the editor, Geert, we decided that I’d start by developing the separate tab option and if we may expand the headword search to incorporate language at a later date, if it’s still considered necessary.

Incorporating a language search is going to mean updating the database and the entry publication scripts (both in the management system and my batch scripts) to extract language data from the entries when they are edited or created. I’ll also need to update the ‘view entry’ page in the DMS so the language data is listed.

My plan of action is to do the following:

  1. Create a new database table that will hold entry IDs, language IDs and whether the entry is a compound.  Where an entry has multiple languages it will have multiple rows in this table.
  2. Write a script that will iterate through the entries, will extract language data and will populate this table
  3. Incorporate the script into the publication workflow and ensure an entry’s language listing is cleared when an entry is deleted prior to a major batch upload.
  4. Update the ‘entry search’ facility in the site’s API to add a new search type for language.  This will accept similar arguments to the existing ‘label’ search type: one or more language IDs, Booleans to be used between the language IDs and  whether the search should be limited to compound words
  5. Add a further endpoint that will return a list of all languages together with a count of the number of entries that feature each language
  6. Update the advanced search page to add in a new ‘Language’ tab.  This will have a similar structure to the ‘Labels’ tab and will feature a list of languages together with counts of associated entries in a scrollable area on the left.  It will be possible to click on a language to add or remove it from a further section on the right of the page where selected languages will be listed.  If multiple languages are selected a drop-down list of Boolean options will appear between each language.  Pressing on the ‘Search’ button after selecting one or more languages will perform a language search.  This will list all corresponding entries in the same way as a Headword search.

I managed to complete the first two tasks, extracting 3097 language tags and adding these to the database.

Also this week I had discussions with the Books and Borrowing people about the official launch of the resource that’s taking place in a couple of weeks.  I’m going to be speaking at the launch so I needed to figure out what I should be talking about.  I also returned to the ‘Browse book editions’ page on the website (https://borrowing.stir.ac.uk/books/) which was at this point taking a long time to load.  This is because the page defaults to displaying all book editions in the system that have a title beginning with ‘A’-  almost 3000 books.  I did consider adding pagination to the facility, but I personally find it easier to scroll through a long page rather than flicking between many smaller pages, plus it means a user can use ‘Find’ in their browser to search the listing.  Another option I considered was to limit the default display to a particular genre of book rather than all genres, but I decided that this might confuse people if they don’t notice the limit has been applied.  Instead I set the page to not load a specific letter tab by default.  The tabs load, but to view the content of one of them the user actually has to select one.  This means the page now loads instantaneously and people get to choose what options they want to view without having a long wait.

Also this week I made some further updates to the Speech Star websites, adding in new ExtIPA animation videos to both the ‘pre’ and ‘post’ 2015 charts.  This was a bit fiddly and took some time but we now have most of the animations in place.  I also exported all of the Historical Thesaurus data for Fraser, as a project needs an up to date copy of it.


Week Beginning 1st April 2024

It was Easter Monday this week, and I’d taken the rest of the week except Friday off as holiday.  I’ll be off for some of next week as well.  After writing my blog post and catching up with emails on Friday I spent the rest of the day working for the Anglo-Norman Dictionary, as the editor was ready to launch a major update to the content – replacing every entry for the letter ‘T’, plus a number of updates to entries in other letters.  I have a documented process that I created for such updates, so thankfully I could follow these instructions (which of course include making backups before changing anything), but it’s still a little scary when thousands of entries are getting deleted and replacement data is generated.

After backing everything up and making all necessary preparations I set the update scripts in motion and thankfully all went smoothly.   The update replaced or created 4472 dictionary entries, of which 1805 were ‘main’ entries, with the rest being cross references.  Of the ‘main’ entries these included 3310 main senses, 1286 subsenses, 2958 locutions, 2060 locution senses and 328 locution subsenses.  Overall the entries included some 15380 citations.

However, during the afternoon one of the editors encountered a problem when updating one of the new entries via the Dictionary’s online management system.  The script that publishes updates quit midway through with an SQL error, the entry could no longer be found on the live site by searching for the headword or variants and duplicates of the entry appearing in the ‘Browse’ pane.

This was clearly rather concerning so all work on the dictionary stopped whilst I investigated.  Thankfully I managed to figure out what was going on.  When subsequently editing the entry some UTF-8 characters had ended up in the attestation dates, but the corresponding columns in the database weren’t set up to store UTF-8.  When the system attempted to insert the characters the database complained and the publication script stopped.  What this meant was the remaining parts of the script didn’t get a chance to execute, and it was in these parts that older versions of the entry were archived, search terms were generated and everything was tidied up.  As these parts didn’t run this meant duplicate entries crept in and search terms weren’t working properly.  I therefore updated the database so all columns could handle UTF-8 data and removed the duplicate entries and thankfully all was working fine again afterwards.

Week Beginning 25th March 2024

This was a four-day week as Friday was the Good Friday holiday.  I’ll also be on holiday for all of next week other than Friday, giving me a nice Easter break.  I spent a lot of this week continuing to overhaul the Speech Star website (the version for speech therapists that hasn’t been publicly launched yet).  I’d made a start with this last week but still had many updates to implement.  This included incorporating the IPA and extIPA charts into the website, but ensuring that only the animation videos rather than the MRI scans were included.  I had also been given a selection of new animations that needed to be incorporated into the IPA and extIPA charts on other websites such as Seeing Speech, which meant updating various databases, creating thumbnails and other such tasks.

The new website now features the IPA / extIPA charts and their animation videos, a selection of speech sound videos, such as animations showing the consonants of English and a page featuring ultrasound therapy videos, as the following screenshot shows.  It’s a very different website to how it was a few weeks ago.

Also this week I was in communication with Susan Rennie regarding the old pilot Scots Thesaurus website, which launched almost ten years ago now.  The domain for this website is due to expire soon and Susan is going to point the domain at a different website, with the original website being archived.  This all required some communication with Susan, the University’s Hostmaster and the Research Data Management team, but I think we’re getting there now.

My final major task of this four-day week was to write an instruction manual for setting up a Speak For Yersel linguistic survey website using the tool I recently created.  The 13-page, 3,500 word document provides step-by-step instructions for creating the data structures the tool requires, the setup facility, customisation and exporting data.   It’s hardly the most riveting of reads, but it should be useful for anyone using the tool in future (including myself!).

Week Beginning 18th March 2024

I was off sick from Tuesday to Thursday this week having caught some sort of virus that laid me low.  However, on Monday I managed to complete the migration of all 156 poems in the Anthology of 16th and Early 17th Century Scots Poetry from ancient HTML to TEI XML.  It’s something I’ve been working on since the New Year and it’s great to have finally completed it.  The site is not yet live, though, as I need to wait until I receive feedback from the project PI.  I’ll probably need to update and expand the information in the TEI header of each poem and I’d also like to make the XML files of the poems available as a downloadable ZIP file from the website, ideally under a Creative Commons license to enable researchers to reuse the data.  I’ll also need to put in redirects from the old URLs, plus the PI has some updates to the poems and their glosses that I will need to incorporate.  But until I hear back from him I can’t do anything else for the resource.

When I returned to work on Friday I had an update to the Anglo-Norman Dictionary waiting for me to process.  The editor Geert had updated all entries beginning with ‘Y’, plus a handful of other related entries and I needed to run my scripts to delete the existing entries and replace them with the new versions.  Thankfully all went smoothly.

I then moved onto a significant reworking of the Speech Star website (one of two websites for the project and the one that has not yet been officially launched).  Previously this website featured access to several databases of ultrasound video files but the new update is going to shift the emphasis to a more focussed set animation video files, plus incorporating the IPA and extIPA charts and associated videos.  There’s a lot to get through in this update and by the end of the day I had implemented only a part of it so I’ll continue with this next week.

Week Beginning 11th March 2024

I spent some time this week further tweaking the Speak For Yersel survey tool I’ve been working on recently.  I completed an initial version of the tool last week, using it to publish a test version of the linguistic survey for the Republic of Ireland (not yet publicly available) and this week I ran the data for Northern Ireland through the tool.  As I did so I began to think about the instructions that would be needed at each stage and I also reworked the final stages of the tool.

The final stage previously involved importing the area GeoJSON file and the settlement CSV file in order to associate actual settlements with larger geographical areas and to generate the areas within which survey responses will be randomly assigned a location.  This was actually a two-step process and I therefore decided to split the stage in two, firstly ensuring the GeoJSON file is successfully parsed and imported and only then importing the settlement CSV file.  I also created a final ‘setup complete’ stage as well, as previously the tool didn’t give any feedback that the process had completed successfully.

With the updates in place creating the Northern Ireland survey using the tool was a pretty straightforward process, taking a matter of minutes.  I then moved on to creating our third and final survey for Wales, but unfortunately I soon realised that we didn’t have a top-level ‘regions’ GeoJSON file for this survey area.  The ‘regions’ file provides the broadest level of geographical areas and are visible on the maps when you hover over areas.  For example in the original SFY resource for Scotland there are 14 top-level regions such as ‘Fife’  or ‘Borders’ with labels that are visible when using the map such as the ones here: https://speakforyersel.ac.uk/explore-maps/lexical/.

Initially I tried creating my own regions in QGIS using the area GeoJSON file to group areas by the region contained in the settlement CSV files (e.g. ‘Anglesey’).  However, this resulted in around 22 regions, which I think is too many for a survey area the size of Wales – for the Republic of Ireland we have 8 and for Northern Ireland we have 11.  I asked the team about this and they are going to do some investigation so until I hear back from them Wales is on hold.

I also spent quite a bit of time this week continuing to migrate the Anthology of 16th and Early 17th Century Scots Poetry from ancient HTML to TEI XML.  Previously the poems I’ve been migrating have varied from 14-line sonnets to poems up to around 200 lines in length.  I’ve been manually copying each line into the Oxygen XML editor as I needed to check and replace ‘3’ characters that had been used to represent yoghs, add in the glosses and check for other issues.  This week I reached the King’s Quair, which compared to the other poems is a bit of an epic, weighing in at over 1300 lines.  I realised manually pasting in each line wasn’t an option if I wanted to keep my sanity and therefore I wrote a little jQuery script that extracted the text from the HTML table cell and generated the necessary XML line syntax.  I was then able to run the script, make a few further tweaks to line groups and then paste the whole poem into Oxygen.  This was significantly quicker than manual migration, but I did still need to add in the glosses, of which there were over 200, so it still took some time.  I continued to import other poems using my new method and I really feel like I’ve broken the back of the anthology now – and by the end of the week I’ve completed the migration of 114 poems.  Hopefully I’ll be able to launch the new site before Easter.

Also this week I began investigating the WCAG accessibility guidelines (https://www.w3.org/WAI/WCAG22/quickref/?versions=2.1) after we received a request about an accessibility statement for the Anglo-Norman Dictionary.  I spoke to a few people in the University who have used accessibility tools to validate websites and managed to perform an initial check of the AND website, which held up pretty well.  I’m intending to look through the guidelines and tools in greater details and hopefully update the sites I manage to make them more accessible after Easter.

Also this week I spoke to Susan Rennie about transferring ownership of the Scots Thesaurus domain to her after the DNS registration expires in April, added some statements to a couple of pages of the Books and Borrowing website referencing the project’s API and giving some information about it, and spoke to B&B project PI Katie Halsey about creating a preservation dataset for the project and depositing it with a research repository.

Week Beginning 4th March 2024

I continued to work on the new Speak For Yersel survey creation tool this week, using the Republic of Ireland as my test area.  I managed to complete the ‘maps’ pages, which let users view all of the survey data on interactive maps.  We’re still testing the new system so there’s not much data to actually view, but below is a screenshot showing one of the maps:

I also tweaked things to make it possible to cite / share / bookmark a specific map.  This does mean that each time you select a different map the entire page needs to reload (as opposed to replacing the map only) but I think it’s handy to be able to link to specific maps and reloading the page is not a massive issue.  The ‘attribution and copyright’ link in the bottom right of the map also now works, with the content of this being set in the config file so it’s easy to change.

I then moved onto the stats and data download page for project staff.  As with SFY you can specify a period to limit the stats and data to the period given.  As with SFY, the number of users and answers for each survey (within and beyond the area of study) are listed.  Unlike SFY, there are also options to download this data as CSV files.  You can download the users, or you can download the answers for a chosen survey, with options for downloading data for users within the area of study or outside it.  Answer downloads also include fields about the user (and the question).  Here’s a screenshot of how the interface looks:

That’s the tool pretty much completed now and I just need to see if there’s any feedback from the project team before I create similar resources for Northern Ireland and Wales using the tool.

I spent much of Monday going through the content management system for the Saints Places website, fixing errors that had appeared as a result of the migration to the site to a new server earlier in the year.  I didn’t create this site or its CMS, but have inherited responsibility for it.  The site (https://saintsplaces.gla.ac.uk/) launched more than ten years ago, with the project officially ending in 2013, but a researcher is still making updates to it so I agreed to get things working again.  It took somewhat longer than I’d expected but at least it’s done now.  Having said that it’s possible more errors lurk in places I’ve not been able to fully test so we’ll just need to see.

On Tuesday I participated in a meeting with Calum McMillan about ‘Change management’ in IT services at the University.  This is about tracking what changes are made to IT systems, ensuring people are informed and strategies are in place in the event of changes failing.  It was very interesting to hear more about this and I hope Arts developers will be kept informed of any changes that will be made to the servers on which our sites are hosted.

I had a further meeting on Friday with Matthew Creasy and his RA Jennifer Alexander about the James Joyce conference website (https://ijjf2024.glasgow.ac.uk/).  I helped out with a couple of technical issues, gave some advice on how to present some materials and made a couple of minor tweaks to the website interface.  I also updated the Seeing Speech (https://www.seeingspeech.ac.uk/), Dynamic Dialects (https://www.dynamicdialects.ac.uk/)  and Speech Star (https://www.seeingspeech.ac.uk/speechstar/) websites to add in licensing and copyright statements.

I also returned to the new data and sparkline facilities for the Dictionaries of the Scots Language.  I’d developed these features last year and had been waiting for feedback from the DSL people, which finally came last week.  One thing they had agreed on was that we should limit the start date of the sparklines to 1375.  When a refamiliarized myself with the work I’d previously done on the sparklines several months ago I realised I still needed to regenerate the data to make the DOST sparklines begin in 1375, and I also realised that the data displayed on our test server was not the current version (featuring the 50 year cut-off and SND extended to 2005) and that this data only existed on my laptop.  I then had a moment of panic when I realised I’d deleted data from my laptop since I completed work on the date features last year, but thankfully I’ve managed to reinstate it.

I also realised that the way dates outside the sparkline’s range are handled will need to be changed.  The way things are currently set up is that any dates that are beyond the scope of the sparkline result in the sparkline being ‘on’ at the start or end date to demonstrate that the entry is attested beyond the sparkline range.  Dates before / after the sparkline start / end date then become the start / end date and are not individually displayed in the sparkline text, and the new start date is then treated as the start date for the purpose of the 50 year rule (for generating blocks of attestation where individual dates are found within 50 years of each other).

This currently happens on our test server for an SND entry where the earliest citation date is 1568 and the second earliest is 1721.  When the sparkline data is generated 1568 becomes 1700 (the start year for SND) and as the gap between this and the next citation is less than the rule the sparkline displays a block from 1700-1721.  The ‘dates of attestation’ hover-over and in-page text then display ‘1700-1721’ which is not at all accurate.

We need to have some kind of line on the sparkline at the start / end to demonstrate an entry’s dates of attestation continue beyond the scope, so for the above example there must be a line at 1700, even though this actually represents the year 1568.  However, such a line needs to be flagged in the system as to be ignored for the purposes of building the blocks so that when the system finds the next date (1721) it compares this to the original date (1568) and not the line created at 1700.  The system would therefore not generate a block from 1700-1721 but instead would create an individual line at 1721 (with an individual line at 1700 as well).

We will also need to ensure that the sparkline text includes all attestations before the sparkline’s start year rather than bundling these all up as ‘1700’.  Actually, I guess there are two options here:  We could just bundle them all up and display ‘<1700’ or we could display them all.  It depends how verbose we want to be and how important it is to list all of the dates beyond the scope of the sparkline.  For the example above the text would either start ‘<1700, 1721, 1773-1825’ or it would start ‘1568, 1721, 1773-1825’.  The same process would also need to happen at the end of the sparkline too.  This needs further discussion with the team before I proceed further.

Despite this issue needing further consideration, I did manage to make a number of other updates to the test site based on the team’s feedback.  This included updating the CSV download on the search results page to add in the search criteria, changing the wording of the advanced search results page, removing date filter options from the search results pages in many situations (e.g. quotation searches) and changing how the filter options are displayed, moving from an always visible approach to a drop-down section that appears only when a user chooses to open it.

Finally this week I continued to work through the Anthology of 16th and Early 17th Century Scots Poetry and I’m now slightly over half-way through the migration of the ancient HTML to TEI XML.

Week Beginning 26th February 2024

I spent much of Monday this week continuing to migrate the Anthology of 16th and Early 17th Century Scots Poetry to TEI XML.  I’ve now migrated 43 poems, so I think I’m about a third of the way there.  It’s still going to take quite some time, but I’ll just keep tackling a few when I find the time, until they’re all done.

I spent the rest of the week continuing to work on the new Speak For Yersel regions.  I’m starting with the Republic of Ireland as my test area for the development of a more generic tool that can then be used to create similar surveys in other geographical areas.  Last week my data import script spotted a few errors with the geographical data for ROI and thankfully the team was able to address these and get an updated dataset to me this week.  When I ran my setup script again the areas (used for deciding where a user’s answer marker should appear on the maps) imported successfully and I could then move onto developing the front-end.

Developing the front-end has involved rationalising the code I’d originally written for the Speak For Yersel website (https://speakforyersel.ac.uk/).  This website has a broader focus than the new surveys will have, involving two additional survey types, plus quizzes and additional activities.  It was also very much an active research project and the code I wrote needed to be updated significantly as the project developed and our ideas and understanding changes.  This meant the code ended up a bit tangled and messy.

My task in developing the front-end for the new survey areas was to extract only those parts of the code that were relevant to the three survey types that will be included and to rationalise it – straightening it out and making it less tangled.  The new code also had to work with configuration variables set during the creation of the resource (e.g. database table prefixes, site subdirectories).

I began developing the scripts to handle the layout and the connections to the config options and the database, generating placeholder pages for things like the homepage and the ‘About’ pages.  I then developed the user registration facility, which I connected to the ROI geographical data thus enabling a user to begin typing the name of their settlement and for possible matches to be displayed in a drop-down list.  Users are then saved in the database and stored using HTML5 Local Storage, including the GeoJSON shape associated with their chosen location, thus enabling markers to be generated within this area at random locations each time the user answers a survey question.  The screenshot below shows the user details registered for a test user I created:

I then began working on the display of the surveys, completing the scripts that list the available surveys, display the intro text for a selected survey, load and display a question for each survey type and handle the progress bar.  I also ensured that images can be displayed for questions in each survey type (if supplied) and have ensured that any associated audio files will play.

The screenshot below shows a ‘phonology’ question, with an associated image, just for test purposes (it’s of no relevance to the question).  I also decided to move the ‘question number’ to the top right of the main pane, which I think helps declutter things a bit (for SFY it was below the progress bar and above the question).  I also made the audio ‘play’ button a fixed width as previously the width was slightly different depending on whether ‘play’ or ‘stop’ was displayed, which made the adjacent button jump slightly during playback.

With this in place I then moved onto the processing of submitted answers.  This has included saving all answers (including where multiple answer options are allowed), displaying the maps, dealing with map filters (e.g. respondent age group) and loading the next question, as you can see in the following screenshot:

I also created the ‘survey complete’ page and have ensured that the system logs which surveys a user has completed (including adding a tick to the survey button of completed surveys on the survey index page).  I still need to create the maps page, add in the map attribution popup and develop the staff page with CSV download options, which I will start on next week.

Week Beginning 19th February 2024

On Monday this week I addressed a couple of issues with the Books and Borrowing search that had been identified last Friday.  Multi-word searches were not working as intended and were returning far too many results.  The reason being (as mentioned last week) a search for ‘Lord Byron’ (without quotes) was searching the specified field for ‘Lord’ and then all fields for ‘Byron’.  It was rather tricky to think through this issue as multi-word searches surrounded by quotes need to be treated differently, as do multi-word searches that contain a Boolean.  We don’t actually mention Booleans in the search help, but AND, OR and NOT (which must be upper-case) can be used in the search fields.

I wrote a new function that hopefully sorts out the search strings as required, but note that search strings containing multiple sets of quotes are not supported as this would be much more complicated to sort out and it seemed like a bit of an edge case.  This new function has been applied to all free-text search fields other than the quick search, which is set to search all fields anyway.  After running several tests I made the update live, and now searching author surnames for ‘Lord Byron’ finds no results, which is as it should be.

Here are some examples that do return content.


  1. If you search book titles for ‘rome’ you currently find 2046 records:


  1. If you search book titles for ‘rome popes’ you currently find 100 records (as this is the equivalent of searching book titles for ‘rome’ AND ‘popes’:


  1. Using Boolean ‘AND’ gives the same results:


  1. A search for ‘rome OR popes’ currently returns 2046 records, presumably because all book titles containing ‘popes’ also contain ‘rome’ (at least I hope that’s the case):


  1. A search for ‘rome NOT popes’ currently brings back 1946 records:


  1. And searches for a full string also work as intended, for example a search for “see of rome”:


With this update in place I then slightly changed the structure of the Solr index to add a new ‘copy’ field that stores publication place as a string, rather than text.  This is then used in the facts, ensuring the full text of the place is displayed rather than being split into tokens.  I then regenerated the cache and asked the helpful IT people in Stirling to update this on the project’s server.  Once the update had been made everything then worked as it should.

Also this week I exported all of the dictionary entries beginning with ‘Y’ for the Anglo-Norman Dictionary as these are now being overhauled.  I also fixed an issue with the Saints Places website – a site I didn’t develop but I’m responsible for now.  A broken query was causing various errors to appear.  The strange thing is the broken query must have been present for years, but presumably the query was previously failing silently while the server on which the site now reside must be more strict.

I spent the rest of the week developing the tool for publishing linguistic surveys for the Speak For Yersel project.  I’d spent a lot of time previously working with the data for the three new linguistic areas, ensuring it was consistently stored so that a tool could be generated to publish the data for all three areas (and more in future).  This week I began developing the tool.  I spent the week developing a ‘setup’ script that would run in a web browser and allow someone to create a new survey website – specifying one or more survey types (e.g. Phonology, Morphology) and uploading sets of questions and answer options for each survey type.  The setup script then provides facilities to integrate the places that the survey area will use to ascertain where a respondent is from and where a marker corresponding to their answer should be located.  This data includes both GoeJSON data and CSV data, both of which need to be analysed by the script and brought together in a relational database.  It took most of the week to create all of the logic for processing all of the above, as you can see a screenshot of one of the stages below:

As I developed the script I realised further tweaks needed to be made to the data, including the addition of a ‘map title’ field that will appear on the map page.  Other fields such as the full question were too verbose to be used here.  I therefore had to update the spreadsheets for all three survey types in all three areas to add this field in.  Similarly, one survey allowed more than one answer to be selected for a few questions, with differing numbers of answers being permitted.  I therefore had to update the spreadsheets to add in a new ‘max answers’ column that the system will then use to ascertain how the answer options should be processed.

I also needed to generate new regions for the Republic of Ireland survey.  Last week I’d created regions using QGIS based on the ‘county’ stored in the GEOJSON data.  This resulted in 26 regions, which is rather more than we used for Scotland.  The team reckoned this was too many and they decided to use a different grouping called NUTS regions (see https://en.wikipedia.org/wiki/NUTS_statistical_regions_of_Ireland).  A member of the team updated the spreadsheet to include these regions for all locations and I was then able to generate GeoJSON data for these new regions using QGIS following the same method as I documented last week.

When processing the data for the Republic of Ireland (which I’m using as my first test area) my setup script spotted some inconsistencies with the data as found in the GeoJSON files and the spreadsheets.  I’ve passed these on to the team who are going to investigate next week.  I’ll also continue to develop the tool next week, moving onto the development of the front-end now that the script for importing the data is now more-or-less complete.

Week Beginning 12th February 2024

I’d taken Monday off this week and on Tuesday I continued to work on the Speak For Yerself follow-on projects.  Last week I started working with the data and discovered that it wasn’t stored in a particularly consistent manner.  I overhauled the ‘lexis’ data and this week I performed a similar task for the ‘morphology’ and ‘phonology’ data.  I also engaged in email conversations with Jennifer and Mary about the data and how it will eventually be accessed by researchers, in addition to the general public.

I then moved on to looking at the GeoJSON data that will be used to ascertain where a user is located and which area the marker representing their answer should be randomly positioned in.  Wales was missing its area data, but thankfully Mary was able to track it down.

For Speak For Yersel we had three levels of locations:

Location:  The individual places that people can select when they register (e.g. ‘Hillhead, Glasgow’).

Area: The wider area that the location is found in.  We store GeoJSON coordinates for these areas and they are then used as the boundaries for placing a random marker to represent the answer of a person who selected a specific location in the area when they registered.  So for example we have a GeoJSON shape for ‘Glasgow Kelvin’ that Hillhead is located in.  Note that these shapes are never displayed on any maps.

Region: the broader geographical region that the area is located in.  These are the areas that appear on the maps (e.g. ‘Glasgow’ or ‘Fife’) and they are stored as GeoJSON files.

For the new areas we didn’t have the ‘region’ data.  I therefore did some experimenting with the QGIS package and I found a way of merging areas to form regions, as the following screenshot demonstrates:

I was therefore able to create the necessary region shapes myself using the following method:

  1. I opened the GeoJSON file in QGIS via the file browser and added the OpenStreetMap XYZ layer in ‘XYZ Tiles’, ensuring this was the bottom layer in the layer browser
  2. In the layer styling right-hand panel selected the ‘ABC’ labels icon and chose ‘County’ as the value, meaning the county names are displayed on the map
  3. In the top row of icons selected the ‘Select Features by area or single click’ icon (the 23rd icon along in my version of QGIS)
  4. I could then do ‘Ctrl+click’ to select multiple areas
  5. Then I selected the ‘Vector’ menu item and selected ‘Geoprocessing’ then ‘Dissolve’
  6. In the dialog box I had to press the green ‘reload’ icon to make the ‘Selected features only’ checkbox clickable then I clicked it
  7. Then I pressed ‘Run’ which created a new, merged shape.
  8. The layer then needed to be saved using the layer browser in the left panel.
  9. This gave me separate GeoJSON files for each region but I was then able to merge them into one file by opening the ‘Toolbox’ by pressing on the cog icon in the top menu bar, searching for ‘merge’ then open ‘Vector general; -> Merge Vector layers, selecting the input layers and ensuring the destination CRS is WGS84, then entering a filename and running the script to merge all layers.

I was then able to edit / create / delete attributes for each region area by pressing on the ‘open attribute table’ icon in the top menu bar.  It’s been a good opportunity to learn more about QGIS and next week I’ll begin updating the code to importing the data and setting up the systems.

Also this week I created an entry for the Books and Borrowing project on this site (see https://digital-humanities.glasgow.ac.uk/project/?id=160).  On Friday afternoon I also investigated a couple of issues with the search that Matt Sangster had spotted.  He noticed that an author surname search for ‘Byron’ wasn’t finding Lord Byron, and entering ‘Lord Byron’ into the surname search was bringing back lots of results that didn’t have this text in the author surname.

It turned out that Byron hadn’t been entered into the system correctly and was in as forename ‘George’, surname ‘Gordon’ with ‘Lord Byron’ as ‘othername’.  I’ll need to regenerate the data once this error has been fixed.  But the second issue, whereby an author surname search for ‘Lord Byron’ was returning lots of records is a strange one.  This would appear to be an issue with searches for multiple words and unfortunately it’s something that will need a major reworking.  I hadn’t noticed previously, but if you search for multiple words without surrounding them by quotes Solr searches the first word against the field and the remaining words against all fields.  So “surname ‘Lord’ OR any field ‘Byron’”.  Whereas what the query should be doing is “surname ‘Lord’ AND surname ‘Byron’”.  This is something that will probably affect all free-text fields.  I’m going to have to update the search to ensure multi-word searches without quotes are processed correctly, which will take some time and I’ll try to tackle next week.  I also need to create a ‘copy’ field for place of publication as this is being tokenised in the search facet options.  So much for thinking my work on this project was at an end!

Also this week I spent many hours going through the Iona map site to compile a spreadsheet listing all of the text that appears in English in order to make the site multilingual.  There is a Gaelic column in the spreadsheet and the plan is that someone will supply the appropriate forms.  There are 157 separate bits of text, with some being individual words and others being somewhat longer.  By far the longest is the content of the copyright and attribution popup, although we might also want to change this as it references the API which might not be made public.  We might also want to change some of the other English text, such as the ‘grid ref’ tooltip that gives as an example a grid reference that isn’t relevant to Iona.  I’ll hold off on developing the multilingual interface until I’m sure the team definitely want to proceed with this.

Finally this week I continued to migrate some of the poems from the Anthology of 16th and Early 17th Century Scots Poetry to TEI XML.  It’s going to take a long time to get through all of them, but progress is being made.