Week Beginning 5th December 2022

I continued my research into RDF, semantic web and linked open data and how they could be applied to the data of the Historical Thesaurus this this week in preparation for a paper I’ll be giving at a workshop in January, and also to learn more about these technologies and concepts in general.  I followed a few tutorials about RDF, for example here https://cambridgesemantics.com/blog/semantic-university/learn-rdf/ and read up about linked open data, for example here https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/.  I also found a site that visualises linked open data projects here https://lod-cloud.net/.

I then manually created a small sample of the HT’s category structure, featuring multiple hierarchical levels and both main and subcategories using the RDF/XML format using the Simple Knowledge Organization System model.  This is a W3C standard for representing thesaurus data in RDF.  More information about it can be found on Wikipedia here: https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System and on the W3C’s website here https://www.w3.org/TR/skos-primer/ and here https://www.w3.org/TR/swbp-skos-core-guide/ and here https://www.w3.org/TR/swbp-thesaurus-pubguide/ and here https://www.w3.org/2001/sw/wiki/SKOS/Dataset.  I also referenced a guide to SKOS for Information Professionals here https://www.ala.org/alcts/resources/z687/skos.  I then imported this manually created sample into the Apache Jena server I set up last week to test that it would work, which thankfully it did.

After that I then wrote a small script to generate a comparable RDF structure for the entire HT category system.  I ran this on an instance of the database on my laptop to avoid overloading the server, and after a few minutes of processing I had an RDF representation of the HT’s hierarchically arranged categories in an XML file that was about 100MB in size.  I fed this into my Apache Jena instance and the import was a success.  I then spent quite a bit of time getting to grips with the SPARQL querying language that is used to query RDF data and by the end of the week I had managed to replicate some of the queries we use in the HT to generate the tree browser, for example ‘get all noun main categories at this level’ or ‘get all noun main categories that are direct children of a specified category’.

I then began experimenting with other RDF tools in the hope of being able to generate some nice visualisations of the RDF data, but this is where things came a bit unstuck.  I set up a nice desktop RDF database called GraphDB (https://www.ontotext.com/products/graphdb/) and also experimented with the Neo4J graph database (https://neo4j.com/) as my assumption had been that graph databases (which store data as dots and lines, like RDF triples) would include functionality to visualise these connections.  Unfortunately I have not been able to find any tools that allow you to just plug RDF data in and visualise it.  I found a Stack Overflow page about this (https://stackoverflow.com/questions/66720/are-there-any-tools-to-visualize-a-rdf-graph-please-include-a-screenshot) but none of the suggestions on the page seemed to work.  I tried downloading the desktop visualisation tool Gephi (https://gephi.org/) as apparently it had a plugin that would enable RDF data to be used, but the plugin is no longer available and other visualisation frameworks such as D3 do not work with RDF data but require the data to be migrated to another format first.  It seems strange that data structured in such a way as to make it ideal for network style visualisations should have no tools available to natively visualise the data and I am rather disappointed by the situation.  Of course it could just be that my Google skills have failed me, but I don’t think so.

In addition to the above I spent some time actually writing the paper that all of this will go into.  I also responded to a query from a researcher at Strathclyde who is putting together a speech and language therapy proposal and wondered whether I’d be able to help out, given my involvement in several other such projects.  I also spoke to the IT people at Stirling about the Solr instance for the Books and Borrowing project and made a few tweaks to the Speech Star project’s introductory text.