Week Beginning 19th December 2022

This was the last week before the Christmas holidays, and Friday was a holiday.  I spent some time on Monday making further updates to the Speech Star data.  I fixed some errors in the data and made some updates to the error type descriptions.  I also made ‘poster’ images from the latest batch of child speech videos I’d created last week as this was something I’d forgotten to do at the time.  I also fixed some issues with the non-disordered speech data, including changing a dash to an underscore in the filenames of the files for one speaker as there had been a mismatch between filenames and metadata, causing none of the videos to open in the site.  I also created records for two projects (The Gentle Shepherd and Speak For Yersel) on this very site (see https://digital-humanities.glasgow.ac.uk/projects/last-updated/) as these are the projects I’ve been working on that have actually launched in the past year.  Other major ones such as Books and Borrowing and Speech Star are not yet ready to share.  I also updated all of the WordPress sites I manage to the latest version.

On Tuesday I travelled into the University to locate my new office.  My stuff had been moved across last week after a leak in the building resulted in water pouring through my office.  Plus work is ongoing to fix the dry rot in the building and I would have needed to move out for that anyway.  It took a little time to get the new office in order and to get my computer equipment set up, but once it was all done it was actually a very nice location – much nicer than the horrible little room I’m usually stuck in.

I spent most of Tuesday upgrading Google Analytics for all of the sites I manage that use it.  Google’s current analytics system is being retired in July next year and I decided to use the time in the run-up to Christmas to migrate the sites over to the new Google Analytics 4 platform.  This was a mostly straightforward process, although as usual Google’s systems feel clunky and counterintuitive at times.  It was also a fairly lengthy process as I had to update the code for each site un question.  Nevertheless I managed to get it done and informed all of the staff whose websites would be affected by the change.  I also had a further chat with Geert, the editor of the Anglo-Norman Dictionary about the new citation edit feature I’m planning at the moment.

On Wednesday I had a meeting with prospective project partners in Strathclyde about a speech therapy proposal we’re putting together.  It was good to meet people and to discuss things.  I’ll be working on the Data Management Plan for the proposal after the holidays.  I spent the rest of the day working on my paper for the workshop I’m attending in Zurich in the second week of January.  I have now finished the paper, which is quite a relief.

On Thursday I spent some time working for the Dictionaries of the Scots Language.  I responded to an email from Ann Fergusson about how we should handle links to ancillary pages in the XML.  There are two issues here that need to be agreed upon.  The first issue is how to represent links to things other than entries in the entry XML.  We currently have the <ref> element that is used to link from one entry to another (e.g. <ref refid=”snd00065761″>Chowky</ref>).  We could use the HTML element <a> in the XML for links to things other than entries but I personally think it’s best not to use this as (in my opinion) it’s better for XML elements to be meaningful when you look at them and the meaning of <a> isn’t especially clear.  It might be better to use <ref> with a different attribute instead of ‘refid’, for example <ref url=”https://dsl.ac.uk/geographical-labels”>.  Reusing <ref> means we don’t need to update the DTD (the rules that define which elements can be used where in the XML) to add a new element.

Of course other people may think that inventing our own way of writing HTML links is daft when everyone is already familiar with <a href=”https://dsl.ac.uk/geographical-labels”> and we could use the latter if people prefer.  If this is the case we would need to update the DTD to allow such elements to be used.  If we didn’t update the DTD the XML files would fail to validate.

Whichever way is chosen, there is a second issue that will need to be addressed:  I will need to update the XSLT that transforms the XML into HTML to tell the script how to handle either a <ref> with a ‘url’ attribute or a <a> with an ‘href’ attribute.  Without updating the XSLT the links won’t work.  I can add such a rule in when we decide how best to represent links in the XML.

I also made a couple of tweaks to the wildcard search term highlighting feature I was working on last week and then published the update on the live DSL site.  Now when you perform a search for something like ‘chr*mas’ and then select an entry to view any work that matches the wildcard pattern will be highlighted.  For example, go to this page: https://dsl.ac.uk/results/chr*mas/fulltext/withquotes/both/ and then select one of the entries and you’ll see the term highlighted in the entry page.

That’s all from me for this year.  Merry chr*mas one and all!

Week Beginning 12th December 2022

There was a problem with the server on which a lot of our major sites such as the Historical Thesaurus and Seeing Speech are hosted that started on Friday and left all of the sites offline until Monday.  This was a really embarrassing and frustrating situation and I had to deal with lots of emails from users of the sites who were unable to access them.  As I don’t have command-line access to the servers all I could do was report the issue via our IT Helpdesk system.  Thankfully by mid-morning on Monday the sites were all back up again, but the incident raised serious issues about the state of Arts IT Support, who are massively understaffed at the moment.  Arts IT also refused to set up hosting for a project that we’re collaborating with Strathclyde University on, and in fact stated that they would not set up hosting for any further websites, which will have a massive negative impact on several projects that are still in the pipeline and ultimately means I will not be able to work on any new projects until this is resolved.  The PI for the new project with Strathclyde is Jane Stuart-Smith, and thankfully she was also not very happy with the situation.  We arranged a meeting with Liz Broe, who oversees Arts IT Support, to discuss the issues and had a good discussion about how we ended up in this state and how things will be resolved.  In the short-term some additional support is being drafted in from other colleges while new staff will be recruited in the medium term, and Liz has stated that hosting for new websites (including the Strathclyde one) will continue to be offered, which is quite a relief.

I also discovered this week that there has been a leak in 13 University Gardens and water has been pouring through my office.  I was already scheduled to be moved out of the building due to the dry rot that they’ve found all the way up the back wall (which my office is on) but this has made things a little more urgent.  I’m still generally working from home every day except Tuesday and apparently all my stuff has been moved to a different building, so I’ll just need to see how the process has gone when I’m back in the University next week.

In terms of actual work this week, I spent a bit more time writing my paper about the Historical Thesaurus and Semantic Web technologies for the workshop in January.  This is coming together now, although I still need to shape it into a presentation, which will take time.  I also spent some time working on the Speech Star project, updating the speech error database to fix a number of issues with the data that Eleanor had spotted and then adding in new error type descriptions for new error types that had been included.  I also added in some ancillary page content and had a chat with Eleanor about the database system the website uses.

I also spent some time working for the DSL this week.  Rhona had noted that when you perform a full text or quotation search (i.e. a search using Solr) with wildcards (e.g. chr*mas) the search results display entries with snippets that highlight the whole word where the search string occurred (e.g. ‘Christmas’).  However, when clicking through to the entry page such highlighting was not appearing, even though highlighting in the entry page does work when performing a search without wildcards.

Highlighting in the entry page was handled by a jQuery plugin, but this was not written to take wildcards into consideration and only works on full words.  I spent some time trying to figure out how to get wildcard highlighting working myself using regular expressions, but I find regular expressions to be pretty awful to work with – an ancient relic left over from computing in the 1980s and although I managed to get something working it wasn’t ideal.  Thankfully I found an existing JavaScript library called https://markjs.io/ that can handle wildcard highlighting and I was able to replace the existing plugin with this script and update the code to work with it.  I tested this out on our test DSL site and all seems to work well.  I haven’t updated the live site yet, though, as the DSL team need to test the new approach out more fully in case they encounter any problems with it.  I also noticed that there was an issue with the quotation search whereby if you returned to the search results from an entry by clicking on the ‘return to results’ button you got an empty page.  I fixed this in both our live and test sites.

I also spent some time working for the Anglo-Norman Dictionary this week.  I updated the citation search on the public website.  Previously the citation text was only added into the search results if you also search for a specific form within a siglum, for example https://anglo-norman.net/search/citation/%22tout%22/null/A-N_Falconry and ther citation searches (e.g. just selecting a siglum and / or a siglum date) would only return the entries the siglum appeared in without the individual citations.  Now the citations appear in these searches too.  For example, all citations from A-N Falconry: https://anglo-norman.net/search/citation/null/null/A-N_Falconry and all citations where the citation date is 1400: https://anglo-norman.net/search/citation/null/1400.  This also means when you view the citations by pressing on the ‘Search AND Citations’ button for a siglum in the bibliography you now see each citation for the listed entries.

I then spent most of a day thinking through all of the issues relating to the new ‘DMS citation search and edit’ feature that the editor wants me to implement and wrote an initial document detailing how the feature will work.  There has been quite a lot to think through and I thought it wise to document the feature rather than just launching into its creation without a clear plan.  I might have some time to start work on this next week as I’m working up to and including Thursday, but it depends how I get on with some other tasks I need to do for other projects.

Also this week I attended the Christmas lunch for the Books and Borrowing project in Edinburgh.  Unfortunately there was a train strike this day and I decided to get the bus through to Edinburgh.  The journey there was fine, talking about an hour and a half, but I got the 4pm bus on the way back and it was a nightmare, taking 2 hours forty minutes.  I would never get the bus between Glasgow and Edinburgh anywhere near rush hour ever again.

Week Beginning 5th December 2022

I continued my research into RDF, semantic web and linked open data and how they could be applied to the data of the Historical Thesaurus this this week in preparation for a paper I’ll be giving at a workshop in January, and also to learn more about these technologies and concepts in general.  I followed a few tutorials about RDF, for example here https://cambridgesemantics.com/blog/semantic-university/learn-rdf/ and read up about linked open data, for example here https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/.  I also found a site that visualises linked open data projects here https://lod-cloud.net/.

I then manually created a small sample of the HT’s category structure, featuring multiple hierarchical levels and both main and subcategories using the RDF/XML format using the Simple Knowledge Organization System model.  This is a W3C standard for representing thesaurus data in RDF.  More information about it can be found on Wikipedia here: https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System and on the W3C’s website here https://www.w3.org/TR/skos-primer/ and here https://www.w3.org/TR/swbp-skos-core-guide/ and here https://www.w3.org/TR/swbp-thesaurus-pubguide/ and here https://www.w3.org/2001/sw/wiki/SKOS/Dataset.  I also referenced a guide to SKOS for Information Professionals here https://www.ala.org/alcts/resources/z687/skos.  I then imported this manually created sample into the Apache Jena server I set up last week to test that it would work, which thankfully it did.

After that I then wrote a small script to generate a comparable RDF structure for the entire HT category system.  I ran this on an instance of the database on my laptop to avoid overloading the server, and after a few minutes of processing I had an RDF representation of the HT’s hierarchically arranged categories in an XML file that was about 100MB in size.  I fed this into my Apache Jena instance and the import was a success.  I then spent quite a bit of time getting to grips with the SPARQL querying language that is used to query RDF data and by the end of the week I had managed to replicate some of the queries we use in the HT to generate the tree browser, for example ‘get all noun main categories at this level’ or ‘get all noun main categories that are direct children of a specified category’.

I then began experimenting with other RDF tools in the hope of being able to generate some nice visualisations of the RDF data, but this is where things came a bit unstuck.  I set up a nice desktop RDF database called GraphDB (https://www.ontotext.com/products/graphdb/) and also experimented with the Neo4J graph database (https://neo4j.com/) as my assumption had been that graph databases (which store data as dots and lines, like RDF triples) would include functionality to visualise these connections.  Unfortunately I have not been able to find any tools that allow you to just plug RDF data in and visualise it.  I found a Stack Overflow page about this (https://stackoverflow.com/questions/66720/are-there-any-tools-to-visualize-a-rdf-graph-please-include-a-screenshot) but none of the suggestions on the page seemed to work.  I tried downloading the desktop visualisation tool Gephi (https://gephi.org/) as apparently it had a plugin that would enable RDF data to be used, but the plugin is no longer available and other visualisation frameworks such as D3 do not work with RDF data but require the data to be migrated to another format first.  It seems strange that data structured in such a way as to make it ideal for network style visualisations should have no tools available to natively visualise the data and I am rather disappointed by the situation.  Of course it could just be that my Google skills have failed me, but I don’t think so.

In addition to the above I spent some time actually writing the paper that all of this will go into.  I also responded to a query from a researcher at Strathclyde who is putting together a speech and language therapy proposal and wondered whether I’d be able to help out, given my involvement in several other such projects.  I also spoke to the IT people at Stirling about the Solr instance for the Books and Borrowing project and made a few tweaks to the Speech Star project’s introductory text.

 

Week Beginning 28th November 2022

There was another strike day on Wednesday this week so it was a four-day week for me.  On Monday I attended a meeting about the Historical Thesaurus, and afterwards I dealt with some issues that cropped up.  These included getting an up to date dump of the HT database to Marc and Fraser, investigating a new subdomain to use for test purposes, looking into adding a new ‘sensitive’ flag to the database for categories that contain potentially offensive content, reminding people where our latest stats page is located and looking into connections between the HT and Mapping Metaphor datasets.  I also spent some more time this week researching semantic web technologies and how these could be used for thesaurus data.  This included setting up an Apache Jena instance on my laptop with a Fuseki server for querying RDF triples using the SPARQL query language.  See https://jena.apache.org/ and https://jena.apache.org/documentation/fuseki2/index.html for more information on these.  I played around with some sample datasets and thought about how our thesaurus data might be structured to use a similar approach.  Hopefully next week I’ll migrate some of the HT data to RDF and experiment with it.

Also this week I spent quite a bit of time speaking to IT Services about the state of the servers that Arts hosts, and migrating the Cullen Project website to a new server as the server it is currently on badly needs upgrades and there is currently no-one to manage this.  Migrating the Cullen Project website took the best part of a day to complete, as all database queries in the code needed to be upgraded.  This took some investigation as it turns out ‘mysqli_’ requires a connection to be passed to it in many of its functions where ‘mysql_’ doesn’t, plus where ‘mysql_’ does require a connection to be passed ‘mysqli_’ has the connection and the string the other way round.  There were also some character encoding issues that were cropping up.  It turned out that these were caused by the database not being UTF-8 and the database connection script needed to set the character-set to ‘latin1’ for the characters to display properly.  Luca also helped with the migration, dealing with the XML and eXistDB side of things and by the end of the week we had a fully operational version of the site running at a temporary URL on a new server.  We put in a request to have the DNS for the project’s domain switched to the new server and once this takes effect we’ll be able to switch the old server off.

Also this week I fixed a couple of minor issues with a couple of the place-names resources, participated in an interview panel for a new role at college level, duplicated a section of the Seeing Speech website on the Dynamic Dialects website at the request of Eleanor Lawson and had discussions about moving out of my office due to work being carried out in the building.