This week I began the major task of integrating book genre with the Books and Borrowing dataset. The team had been working on a spreadsheet that enabled them to assign top-level Book Work records to more than 13,000 Book Edition records and also assign up to three genres to each Work. I had to write a script to parse this data, which involved extracting and storing the distinct genres, creating Book Work records, assigning Book Work authors, adding in associations to Book Edition records, deleting any author associations at Edition level and creating associations between Works and genres. It took the best part of two days to create and test the script, running it on a local version of the data stored on my laptop. After final testing the number of active Book Works increased from 75 to 9808 and the number of active Book Editions that have a Work association grew from 72 to 13,099. The number of genre connections for Works stood at 11,536 and the number of active Book Works that have at least one author association stood at 9,808, up from 70, while the number of active Book Editions with at least one direct author association decreased to 2,191 from 14,384, due to the author association being shifted up to Work (and it will cascade from there).
With the data import sorted I then moved onto updating the project’s content management system to incorporate all facilities to add, edit, browse and delete genres. This included creating facilities for associating genres with book records at any level (from Work down to Item) wherever books can be edited in the CMS. The ‘Browse Genres’ page works in a similar way to ‘Browse Authors’, giving you a list of genres and a count of the number of each book at each level that has an association, as the following screenshot shows:
Pressing on a number opens a pop-up containing a list of the associated books and you can connect through to each book record from this. As with authors, genre will cascade down from whichever level of book it is associated with to all lower levels. You only need to make an association at a lower level if it differs from the genre at a higher level. The counts in the ‘browse’ page show only the direct associations, so for now there are no editions or lower with any numbers listed. Wherever a book at any level can be edited in the CMS a new ‘Genre’ section has been added to the edit form. This consists of a list of genres with checkboxes beside them, as the following screenshot demonstrates:
You can tick as many checkboxes as are required and when updating the record the changes will be made. I tested out the new genre features in the CMS and all seem to be working well. I also imported all of the genre data so hopefully everything is now in place. Next week I will move onto the front-end, where there is much to do – not only making genre visible wherever books are viewed but updating the search facilities and adding in a number of new visualisations for genre as well. I also fixed a few issues with images of Registers from the Royal High School – a few that were missing I added in and the order of others needed to be updated.
Also this week I finalised the new project website for Rhona Brown’s new project. It’s not live yet, but my work on it is now complete. I was also involved in the migration of a number of my sites to a new server. As always seems to be the case, the DSL website migration did not go very smoothly, with the DNS update taking a many hours to propagate and in the meantime the domain was serving up the Anglo-Norman Dictionary, which was not good at all. This wasn’t something I had any direct control over, unfortunately, but thankfully the situation rectified itself the following day.
I also had to make a number of tweaks to the data in the Child speech error database for Speech Star, after many transcriptions were revised. I also updated the Mull / Ulva place-names CMS to add in a facility to export place-names for publication limited by one or more selected islands. In addition I began creating a new website for a project Gavin Miller is running and I created some new flat spreadsheet exports of the Historical Thesaurus for Fraser Dallachy and Marc Alexander to work with.
I came down with some sort of flu-like illness last Friday evening and was still unwell on Monday and unable to work. Thankfully I was well enough to work again on Tuesday, although getting through the day was hard work. I was also off on holiday on Friday this week so only ended up working three days. I’ll be on holiday all of next week as well as it’s the school half-term and we have a family holiday booked.
I was involved in the migration of the Historical Thesaurus website to a new server for a lot of this week. This required a lot of testing of the newly migrated site and a significant number of small updates to the code to ensure everything worked properly. Thankfully by Thursday all was working well and I was able to go on my holiday without worrying about the site.
Also this week I did some further work on the Books and Borrowing project, which included generating several different spreadsheets of book holdings that have no associated borrowing records and discussing the options of creating downloadable bundles of all data associated with each specific library.
I also did some work for the Dictionaries of the Scots Language, including investigating an issue with the new quotations search that is not yet live but is running on our test server. A phrase search for quotations was now working, but an identical phrase search using the full-text index was working fine. This was a bit of a strange one as it looks like the new Solr quotation search is not picking up the fact that a phrase search is being run. I tried running the search directly on the Solr instance I’d set up on my laptop and the same thing was happening: I gave it a phrase surrounded by double quotes but these were being ignored. An identical search on the fulltext Solr index picked up the presence of quotes and successfully performed a search for the phrase. The only difference between the two fields is that the fulltext fields was set to ‘text_general’ while the quote search was set to ‘text_en’. I therefore set up a new version of the quote index with the field set to ‘text_general’ and this solved the problem. I’m still in the dark as to why, though, and I can’t find any information online about the issue.
I also responded to a request from Craig Lamont in Scottish Literature about a new proposal he’s putting together. If it gets funded I’ll be involved with the project, making a website, an interactivem map and a timeline. I also had a conversation with Rhona Brown about the website for her new project, which I’ll set up after I’m back from my holiday.
I spent most of this week working for the Dictionaries of the Scots Language, working on the new quotation date search. I decided to work on the update on a version of the site and its data running on my laptop initially, as I have direct control over the Solr instance running on my laptop – something I don’t have on the server. My first task was to create a new Solr index for the quotations and to write a script to export data from the database in a format that Solr could then index. With over 700,000 quotations this took a bit of time, and I did encounter some issues, such as several tens of thousands of quotations not having date tags, meaning dates for the quotations could not be extracted. I had a lengthy email conversation with the DSL team about this and thankfully it looks like the issue is not something I need to deal with: data is being worked on in their editing system and the vast majority of the dating issues I’d encountered will be fixed the next time the data is exported for me to use. I also encountered some further issues that needed o be addressed as I worked with the data. For example, I realised I needed to add a count of the total number of quotes for an entry to each quote item in Solr to be able to work out the ranking algorithm for entries and this meant updating the export script, the structure of the Solr index and then re-exporting all 700,000 quotations. Below is screenshot of the Solr admin interface, showing a query of the new quotation index – a search for ‘barrow’.
With this in place I then needed to update the API that processes search requests, connects to Solr and spits out the search results in a suitable format for use on the website. This meant completely separating out and overhauling the quotation search, as it needed to connect to a different Solr index that featured data that had a very different structure. I needed to ensure quotations could be grouped by their entries and then subjected to the same ‘max results’ limitations as other searches. I also needed to create the ranking algorithm for entries based on the number of returned quotes vs the total number of quotes, sort the entries based on this and also ensure a maximum of 10 quotes per entry were displayed. I also had to add in a further search option for dates, as I’d already detailed in the requirements document I’d previously written. The screenshot below is of the new quotation endpoint in the API, showing a section of the results for ‘barrow’ in ‘snd’ between 1800 and 1900.
The next step was to update the front-end to add in the new ‘date’ drop-down when quotations are selected and then to ensure the new quotation search information could be properly extracted, formatted and passed to the API to return the relevant data. The following screenshot shows the search form. The explanatory text still needs some work as it currently doesn’t feel very elegant – I think there’s a ‘to’ missing somewhere.
The final step for the week was to deal with the actual results themselves, as they are rather different in structure to the previous results, as entries now potentially have multiple quotes, each of which contains information relating to the quote (e.g. dates, bib ID) and each of which may feature multiple snippets, if the term appears several times within a single quote. I’ve managed to get the results to display correctly and the screenshot below shows the results of a search for ‘barrow’ in snd between 1800 and 1900.
The new search also now lets you perform a Boolean search on the contents of individual quotations rather than all quotations in an entry. So for example you can search for ‘Messages AND Wean’ in quotes from 1980-1999 and only find those that match whereas previously if an entry featured one quote with ‘messages’ and another with ‘wean’ it would get returned. The screenshot below shows the new results.
There are a few things that I need to discuss with the team, though. Firstly the ranking system. As previously agreed upon, entries are ranked based on the proportion of quotes that contain the search term. But this is possibly ranking entries that only have one quote too highly. If there is only one quote and it features the term then 100% of quotes feature the term so the entry is highly ranked, but longer, possibly more important entries are ranked lower because (for example) out of 50 quotes 40 feature the term. We might want to look into weighting entries that have more quotes overall. For example, an SND quotation search for ‘prince’ (see below). ‘Prince’ is ranked first, but then results 2-6 appear because they only have one quote, which happens to feature ‘prince’.
The second issues is that the new system cuts off quotations for entries after the tenth (as you can see for ‘Prince’, above). We’d agreed on this approach to stop entries with lots of quotes swamping the results, but currently nothing is displayed to say that the results have been snipped. We might want to add a note under the tenth quote.
The third issue is that the quote field in Solr is currently stemmed, meaning the stems of words are stored and Solr can then match alternative forms. This can work well – for example the ‘messages AND wean’ results include results for ‘message’ and ‘weans’ too. But it can also be a bit too broad. See for example the screenshot below, which shows a quotation search for ‘aggressive’. As you can see, it has returned quotations that feature ‘aggression’, ‘aggressively’ and ‘aggress’ in addition to ‘aggressive’. This might be useful, but it might cause confusion and we’ll need to discuss this further at some point.
Next week I’ll hopefully start work on the filtering of search results for all search types, which will involve a major change to the way headword searches work and more big changes to the Solr indexes.
Also this week I investigated applying OED DOIs to the OED lexemes we link to in the Historical Thesaurus. Each OED sense now has its own DOI that we can get access to, and I was sent a spreadsheet containing several thousand as an example. The idea is that links from the HT’s lexemes to the OED would be updated to use these DOIs rather than performing a search of the OED for the work, which is what currently happens.
After a few hours of research I reckoned it would be possible to apply the DOIs to the HT data, but there are some things that we’ll need to consider. The OED spreadsheet looks like it will contain every sense and the HT data does not, so much of the spreadsheet will likely not match anything in our system. I wrote a little script to check the spreadsheet against the HT’s OED lexeme table and 6186 rows in the spreadsheet match one (or more) lexeme in the database table while 7256 don’t. I also noted that the combination of entry_id and element_id (in our database called refentry and refid) is not necessarily unique in the HT’s OED lexeme table. This can be if a word appears in multiple categories, plus there is a further ID called ‘lemmaid’ that was sometimes used to differentiate specific lexemes in combination with the other two IDs. In the spreadsheet there are 1180 rows that match multiple rows in the HT’s OED lexeme table. However, this also isn’t a problem and usually just means a word appears in multiple categories. It just means that the same DOI would apply to multiple lexemes.
What is potentially a problem is that we haven’t matched up all of the OED lexeme records with the HT lexeme records. While 6186 rows in the spreadsheet match one or more rows in the OED lexeme table, only 4425 rows in the spreadsheet match one or more rows in the HT’s lexeme table. We will not be able to update the links to switch to DOIs for any HT lexemes that aren’t matched to an OED lexeme. After checking I discovered that there are 87,713 non-OE lexemes in the HT lexeme table that are not linked to an OED lexeme. None of these will be able to have a DOI (and neither will the OE words, presumably).
Another potential problem is that the sense an HT lexeme is linked to is not necessarily the main sense for the OED lexeme. In such cases the DOI then leads to a section of the OED entry that is only accessible to logged in users of the OED site. An example from the spreadsheet is ‘aardvark’. Our HT lexeme links to entry_id 22, element_id 16201412, which has the DOI https://doi.org/10.1093/OED/1516256385 which when you’re not logged in displays a ‘Please purchase a subscription’ page. The other entry for ‘aardvark’ in the spreadsheet has entry_id 22 and element_id 16201390, which has the DOI https://doi.org/10.1093/OED/9531538482 which leads to the summary page, but the HT’s link will be the first DOI above and not the second. Note that currently we link to the search results on the OED site, which actually might be more useful for many people. Aarkvark as found here: https://ht.ac.uk/category/?type=search&qsearch=aardvark&page=1#id=39313 currently links to this OED page: https://www.oed.com/search/dictionary/?q=aard-vark
To summarise: I can update all lexemes in the HT’s OED lexeme table that match the entry_id and element_id columns in the spreadsheet to add in the relevant DOI. I can also then ensure that any HT lexeme records linked to these OED lexemes also feature the DOI, but this will apply to less lexemes due to there still being many HT lexemes that are not linked. I could then update the links through to the OED for these lexemes, but this might not actually work as well as the current link to search results due to many OED DOIs leading to restricted pages. I’ll need to hear back from the rest of the team before I can take this further.
Also this week I had a meeting with Pauline Mackay and Craig Lamont to discuss an interactive map of Burns’ correspondents. We’d discussed this about three years ago and the are now reaching a point where they would like to develop the map. We discussed various options for base maps, data categorisation and time sliders and I gave them a demonstration of the Books and Borrowing project’s Chamber’s library map, which I’d previously developed (https://borrowing.stir.ac.uk/chambers-library-map/). They were pretty impressed with this and thought it would be a good model for their map. Pauline and Craig are now going to work on some sample data to get me started, and once I receive this I’ll be able to begin development. We had our meeting in the café of the new ARC building, which I’d never been to before, so it was a good opportunity to see the place.
Also this week I fixed some issues with images for one of the library registers for the Royal High School for the Books and Borrowing project. These had been assigned the wrong ID in the spreadsheet I’d initially used to generate the data and I needed to write a little script to rectify this.
Finally, I had a chat with Joanna Kopaczyk about a potential project she’s putting together. I can’t say much about it at this stage, but I’ll probably be able to use the systems I developed last year for the Anglo-Norman Dictionary’s Textbase (see https://anglo-norman.net/textbase-browse/ and https://anglo-norman.net/textbase-search/). I’m meeting with Joanna to discuss this further next week.
I attended the workshop ‘The impact of multilingualism on the vocabulary and stylistics of Medieval English’ in Zurich this week. The workshop ran on Tuesday and Wednesday and I travelled to Zurich with my colleagues Marc Alexander and Fraser Dallachy on Monday. It was really great to travel to a workshop in a different country again as I’d not been abroad since before Lockdown. I’d never been to Zurich before and it was a lovely city. The workshop itself was great, with some very interesting papers and good opportunities to meet other researchers and discuss potential future projects. I gave a paper on the Historical Thesaurus, its categories and data structures and how semantic web technologies may be used to more effectively structure, manage and share the Historical Thesaurus’s semantically arranged dataset. It was a half-hour paper with 10 minutes for questions afterwards and it went pretty well. The audience wasn’t especially technical and I’m not sure how interesting the topic was to most people, but it was well received and I’m glad I had the opportunity to both attend the event and to research the topic as I have greatly increased my knowledge of semantic web technologies such as RDF, graph databases and SPARQL, and as part of the research I managed to write a script that generated an RDF version of the complete HT category data, which may come in handy one day.
I got back home just before midnight on the Wednesday and returned to normal work first thing on Thursday. This included submitting my expenses from the workshop and replying to a few emails that had come in regarding my office (it looks like the dry rot work is going to take a while to resolve and it also looks like I’ll have to share my temporary office) and attempting to set up web hosting for the VARICS project, which Arts IT Support seem reluctant to do. I also looked into an issue with the DSL that Ann Ferguson had spotted and spoke to the IT people at Stirling about their current progress with setting up a Solr instance for the Books and Borrowing project. I also replaced a selection of library register images with better versions for that project and arranged a meeting for next Monday with the project’s PI and Co-I to discuss progress with the front-end.
I spent most of Friday writing a Data Management Plan and attending a Zoom call for a new speech therapy project I’m involved with. It’s an ESRC funding proposal involving Glasgow and Strathclyde and I’ll be managing the technical aspects. We had a useful call and I managed to complete an initial version of the DMP that the PI is going to adapt if required.
The first week back after the Christmas holidays was supposed to be a three-day week, but unfortunately after returning to work on Wednesday I started with some sort of winter vomiting virus that affected me throughout Wednesday night and I was off work on Thursday. I was still feeling very shaky on Friday but I managed to do a full day’s work nonetheless.
My two days were mostly spent creating my slides for the talk I’m giving at a workshop in Zurich next week and then practising the talk. I also engaged in an email conversation about the state of Arts IT Support after the database on the server that hosts many of our most important websites went down on the first day of the Christmas holidays and remained offline for the best part of two weeks. This took down websites such as the Historical Thesaurus, Seeing Speech, The Glasgow Story and the Emblems websites and I had to spend time over the holidays replying and apologising to people who contacted me about the sites being unavailable. As I don’t have command-line access to the servers there was nothing I could do to fix the issue and despite several members of staff contacting Arts IT Support no response was received from them. The issue was finally resolved on the 3rd of January but we have still received no communication from Arts IT Support to either inform us that the issue has been resolved, to let us know what caused the issue or to apologise for the incident, which is really not good enough. Arts IT Support are in a shocking state at the moment due to critical staff leaving and not being replaced and I’m afraid it looks like the situation may not improve for several months yet, meaning issues with our website are likely to continue in 2023.
This was the last week before the Christmas holidays, and Friday was a holiday. I spent some time on Monday making further updates to the Speech Star data. I fixed some errors in the data and made some updates to the error type descriptions. I also made ‘poster’ images from the latest batch of child speech videos I’d created last week as this was something I’d forgotten to do at the time. I also fixed some issues with the non-disordered speech data, including changing a dash to an underscore in the filenames of the files for one speaker as there had been a mismatch between filenames and metadata, causing none of the videos to open in the site. I also created records for two projects (The Gentle Shepherd and Speak For Yersel) on this very site (see https://digital-humanities.glasgow.ac.uk/projects/last-updated/) as these are the projects I’ve been working on that have actually launched in the past year. Other major ones such as Books and Borrowing and Speech Star are not yet ready to share. I also updated all of the WordPress sites I manage to the latest version.
On Tuesday I travelled into the University to locate my new office. My stuff had been moved across last week after a leak in the building resulted in water pouring through my office. Plus work is ongoing to fix the dry rot in the building and I would have needed to move out for that anyway. It took a little time to get the new office in order and to get my computer equipment set up, but once it was all done it was actually a very nice location – much nicer than the horrible little room I’m usually stuck in.
I spent most of Tuesday upgrading Google Analytics for all of the sites I manage that use it. Google’s current analytics system is being retired in July next year and I decided to use the time in the run-up to Christmas to migrate the sites over to the new Google Analytics 4 platform. This was a mostly straightforward process, although as usual Google’s systems feel clunky and counterintuitive at times. It was also a fairly lengthy process as I had to update the code for each site un question. Nevertheless I managed to get it done and informed all of the staff whose websites would be affected by the change. I also had a further chat with Geert, the editor of the Anglo-Norman Dictionary about the new citation edit feature I’m planning at the moment.
On Wednesday I had a meeting with prospective project partners in Strathclyde about a speech therapy proposal we’re putting together. It was good to meet people and to discuss things. I’ll be working on the Data Management Plan for the proposal after the holidays. I spent the rest of the day working on my paper for the workshop I’m attending in Zurich in the second week of January. I have now finished the paper, which is quite a relief.
On Thursday I spent some time working for the Dictionaries of the Scots Language. I responded to an email from Ann Fergusson about how we should handle links to ancillary pages in the XML. There are two issues here that need to be agreed upon. The first issue is how to represent links to things other than entries in the entry XML. We currently have the <ref> element that is used to link from one entry to another (e.g. <ref refid=”snd00065761″>Chowky</ref>). We could use the HTML element <a> in the XML for links to things other than entries but I personally think it’s best not to use this as (in my opinion) it’s better for XML elements to be meaningful when you look at them and the meaning of <a> isn’t especially clear. It might be better to use <ref> with a different attribute instead of ‘refid’, for example <ref url=”https://dsl.ac.uk/geographical-labels”>. Reusing <ref> means we don’t need to update the DTD (the rules that define which elements can be used where in the XML) to add a new element.
Of course other people may think that inventing our own way of writing HTML links is daft when everyone is already familiar with <a href=”https://dsl.ac.uk/geographical-labels”> and we could use the latter if people prefer. If this is the case we would need to update the DTD to allow such elements to be used. If we didn’t update the DTD the XML files would fail to validate.
Whichever way is chosen, there is a second issue that will need to be addressed: I will need to update the XSLT that transforms the XML into HTML to tell the script how to handle either a <ref> with a ‘url’ attribute or a <a> with an ‘href’ attribute. Without updating the XSLT the links won’t work. I can add such a rule in when we decide how best to represent links in the XML.
I also made a couple of tweaks to the wildcard search term highlighting feature I was working on last week and then published the update on the live DSL site. Now when you perform a search for something like ‘chr*mas’ and then select an entry to view any work that matches the wildcard pattern will be highlighted. For example, go to this page: https://dsl.ac.uk/results/chr*mas/fulltext/withquotes/both/ and then select one of the entries and you’ll see the term highlighted in the entry page.
That’s all from me for this year. Merry chr*mas one and all!
There was a problem with the server on which a lot of our major sites such as the Historical Thesaurus and Seeing Speech are hosted that started on Friday and left all of the sites offline until Monday. This was a really embarrassing and frustrating situation and I had to deal with lots of emails from users of the sites who were unable to access them. As I don’t have command-line access to the servers all I could do was report the issue via our IT Helpdesk system. Thankfully by mid-morning on Monday the sites were all back up again, but the incident raised serious issues about the state of Arts IT Support, who are massively understaffed at the moment. Arts IT also refused to set up hosting for a project that we’re collaborating with Strathclyde University on, and in fact stated that they would not set up hosting for any further websites, which will have a massive negative impact on several projects that are still in the pipeline and ultimately means I will not be able to work on any new projects until this is resolved. The PI for the new project with Strathclyde is Jane Stuart-Smith, and thankfully she was also not very happy with the situation. We arranged a meeting with Liz Broe, who oversees Arts IT Support, to discuss the issues and had a good discussion about how we ended up in this state and how things will be resolved. In the short-term some additional support is being drafted in from other colleges while new staff will be recruited in the medium term, and Liz has stated that hosting for new websites (including the Strathclyde one) will continue to be offered, which is quite a relief.
I also discovered this week that there has been a leak in 13 University Gardens and water has been pouring through my office. I was already scheduled to be moved out of the building due to the dry rot that they’ve found all the way up the back wall (which my office is on) but this has made things a little more urgent. I’m still generally working from home every day except Tuesday and apparently all my stuff has been moved to a different building, so I’ll just need to see how the process has gone when I’m back in the University next week.
In terms of actual work this week, I spent a bit more time writing my paper about the Historical Thesaurus and Semantic Web technologies for the workshop in January. This is coming together now, although I still need to shape it into a presentation, which will take time. I also spent some time working on the Speech Star project, updating the speech error database to fix a number of issues with the data that Eleanor had spotted and then adding in new error type descriptions for new error types that had been included. I also added in some ancillary page content and had a chat with Eleanor about the database system the website uses.
I also spent some time working for the DSL this week. Rhona had noted that when you perform a full text or quotation search (i.e. a search using Solr) with wildcards (e.g. chr*mas) the search results display entries with snippets that highlight the whole word where the search string occurred (e.g. ‘Christmas’). However, when clicking through to the entry page such highlighting was not appearing, even though highlighting in the entry page does work when performing a search without wildcards.
I also spent some time working for the Anglo-Norman Dictionary this week. I updated the citation search on the public website. Previously the citation text was only added into the search results if you also search for a specific form within a siglum, for example https://anglo-norman.net/search/citation/%22tout%22/null/A-N_Falconry and ther citation searches (e.g. just selecting a siglum and / or a siglum date) would only return the entries the siglum appeared in without the individual citations. Now the citations appear in these searches too. For example, all citations from A-N Falconry: https://anglo-norman.net/search/citation/null/null/A-N_Falconry and all citations where the citation date is 1400: https://anglo-norman.net/search/citation/null/1400. This also means when you view the citations by pressing on the ‘Search AND Citations’ button for a siglum in the bibliography you now see each citation for the listed entries.
I then spent most of a day thinking through all of the issues relating to the new ‘DMS citation search and edit’ feature that the editor wants me to implement and wrote an initial document detailing how the feature will work. There has been quite a lot to think through and I thought it wise to document the feature rather than just launching into its creation without a clear plan. I might have some time to start work on this next week as I’m working up to and including Thursday, but it depends how I get on with some other tasks I need to do for other projects.
Also this week I attended the Christmas lunch for the Books and Borrowing project in Edinburgh. Unfortunately there was a train strike this day and I decided to get the bus through to Edinburgh. The journey there was fine, talking about an hour and a half, but I got the 4pm bus on the way back and it was a nightmare, taking 2 hours forty minutes. I would never get the bus between Glasgow and Edinburgh anywhere near rush hour ever again.
I continued my research into RDF, semantic web and linked open data and how they could be applied to the data of the Historical Thesaurus this this week in preparation for a paper I’ll be giving at a workshop in January, and also to learn more about these technologies and concepts in general. I followed a few tutorials about RDF, for example here https://cambridgesemantics.com/blog/semantic-university/learn-rdf/ and read up about linked open data, for example here https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/. I also found a site that visualises linked open data projects here https://lod-cloud.net/.
I then manually created a small sample of the HT’s category structure, featuring multiple hierarchical levels and both main and subcategories using the RDF/XML format using the Simple Knowledge Organization System model. This is a W3C standard for representing thesaurus data in RDF. More information about it can be found on Wikipedia here: https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System and on the W3C’s website here https://www.w3.org/TR/skos-primer/ and here https://www.w3.org/TR/swbp-skos-core-guide/ and here https://www.w3.org/TR/swbp-thesaurus-pubguide/ and here https://www.w3.org/2001/sw/wiki/SKOS/Dataset. I also referenced a guide to SKOS for Information Professionals here https://www.ala.org/alcts/resources/z687/skos. I then imported this manually created sample into the Apache Jena server I set up last week to test that it would work, which thankfully it did.
After that I then wrote a small script to generate a comparable RDF structure for the entire HT category system. I ran this on an instance of the database on my laptop to avoid overloading the server, and after a few minutes of processing I had an RDF representation of the HT’s hierarchically arranged categories in an XML file that was about 100MB in size. I fed this into my Apache Jena instance and the import was a success. I then spent quite a bit of time getting to grips with the SPARQL querying language that is used to query RDF data and by the end of the week I had managed to replicate some of the queries we use in the HT to generate the tree browser, for example ‘get all noun main categories at this level’ or ‘get all noun main categories that are direct children of a specified category’.
I then began experimenting with other RDF tools in the hope of being able to generate some nice visualisations of the RDF data, but this is where things came a bit unstuck. I set up a nice desktop RDF database called GraphDB (https://www.ontotext.com/products/graphdb/) and also experimented with the Neo4J graph database (https://neo4j.com/) as my assumption had been that graph databases (which store data as dots and lines, like RDF triples) would include functionality to visualise these connections. Unfortunately I have not been able to find any tools that allow you to just plug RDF data in and visualise it. I found a Stack Overflow page about this (https://stackoverflow.com/questions/66720/are-there-any-tools-to-visualize-a-rdf-graph-please-include-a-screenshot) but none of the suggestions on the page seemed to work. I tried downloading the desktop visualisation tool Gephi (https://gephi.org/) as apparently it had a plugin that would enable RDF data to be used, but the plugin is no longer available and other visualisation frameworks such as D3 do not work with RDF data but require the data to be migrated to another format first. It seems strange that data structured in such a way as to make it ideal for network style visualisations should have no tools available to natively visualise the data and I am rather disappointed by the situation. Of course it could just be that my Google skills have failed me, but I don’t think so.
In addition to the above I spent some time actually writing the paper that all of this will go into. I also responded to a query from a researcher at Strathclyde who is putting together a speech and language therapy proposal and wondered whether I’d be able to help out, given my involvement in several other such projects. I also spoke to the IT people at Stirling about the Solr instance for the Books and Borrowing project and made a few tweaks to the Speech Star project’s introductory text.
There was another strike day on Wednesday this week so it was a four-day week for me. On Monday I attended a meeting about the Historical Thesaurus, and afterwards I dealt with some issues that cropped up. These included getting an up to date dump of the HT database to Marc and Fraser, investigating a new subdomain to use for test purposes, looking into adding a new ‘sensitive’ flag to the database for categories that contain potentially offensive content, reminding people where our latest stats page is located and looking into connections between the HT and Mapping Metaphor datasets. I also spent some more time this week researching semantic web technologies and how these could be used for thesaurus data. This included setting up an Apache Jena instance on my laptop with a Fuseki server for querying RDF triples using the SPARQL query language. See https://jena.apache.org/ and https://jena.apache.org/documentation/fuseki2/index.html for more information on these. I played around with some sample datasets and thought about how our thesaurus data might be structured to use a similar approach. Hopefully next week I’ll migrate some of the HT data to RDF and experiment with it.
Also this week I spent quite a bit of time speaking to IT Services about the state of the servers that Arts hosts, and migrating the Cullen Project website to a new server as the server it is currently on badly needs upgrades and there is currently no-one to manage this. Migrating the Cullen Project website took the best part of a day to complete, as all database queries in the code needed to be upgraded. This took some investigation as it turns out ‘mysqli_’ requires a connection to be passed to it in many of its functions where ‘mysql_’ doesn’t, plus where ‘mysql_’ does require a connection to be passed ‘mysqli_’ has the connection and the string the other way round. There were also some character encoding issues that were cropping up. It turned out that these were caused by the database not being UTF-8 and the database connection script needed to set the character-set to ‘latin1’ for the characters to display properly. Luca also helped with the migration, dealing with the XML and eXistDB side of things and by the end of the week we had a fully operational version of the site running at a temporary URL on a new server. We put in a request to have the DNS for the project’s domain switched to the new server and once this takes effect we’ll be able to switch the old server off.
Also this week I fixed a couple of minor issues with a couple of the place-names resources, participated in an interview panel for a new role at college level, duplicated a section of the Seeing Speech website on the Dynamic Dialects website at the request of Eleanor Lawson and had discussions about moving out of my office due to work being carried out in the building.
I spent almost all of this week working with a version of Apache Solr installed on my laptop, experimenting with data from the Books and Borrowing project and getting to grips with setting up a data core and customising a schema for the data, preparing data for ingest into Solr, importing the data and running queries on it, including facetted searching.
I started the week experimenting with our existing database, creating a cache table and writing a script to import a sample of 100 records. This cache table could hold all of the data that the quick search would need to query and would be very speedy to search, but I realised that other aspects related to the searching would still be slow. Facetted searching would still require several other database queries to be executed, as would extracting all of the fields that would be necessary to display the search results and it seemed inadvisable to try and create all of this functionality myself when an existing package like Solr could already do it all.
Solr is considerably faster than using the database approach and its querying is much more flexible. It also offers facetted search options that are returned pretty much instantaneously which would be hopelessly slow if I attempted to create something comparable directly with the database. For example, I can query the Solr data to find all borrowing records that involve a book holding record with a standardised title that includes the word ‘Roman’, returning 3325 records, but Solr can then also return a breakdown of the number of records by other fields, for example publication place:
“8vo., plates, port., maps.”,88,
“8vo., plates: maps.”,16
These would then allow me to build in the options to refine the search results further by one (or more) of the above criteria. Although it would be possible to build such a query mechanism myself using the database it is likely that such an approach would be much slower and would take me time to develop. It seems much more sensible to use an existing solution if this is going to be possible.
In my experiments with Solr on my laptop I Initially imported 100 borrowing records exported via the API call I created to generate the search results page. This gave me a good starting point to experiment with Solr’s search capabilities, but the structure of the JSON file returned from the API was rather more complicated than we’d need purely for search purposes and includes a lot of data that’s not really needed either, as the returned data contains everything that’s needed to display the full borrowing record. I therefore worked out a simpler JSON structure that would only contain the fields that we would either want to search or could be used in a simplified search results page. Here’s an example:
“lname”: “Glasgow University Library”,
“transcription”: “Euseb: Eclesiastical History”,
“standardisedtitle”: “Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned historiographers, Eusebius, Socrates, and Evagrius.”,
“bfullnames”: [“Charles Wilson”],
“boccs”: [“University Student”, “Education”],
“asnames”: [“Eusebius of Caesarea”],
“afullnames”: [” Eusebius of Caesarea”],
“edtitles”: [“Ancient ecclesiasticall histories of the first six hundred years after Christ; written in the Greek tongue by three learned historiographers, Eusebius, Socrates, and Evagrius.”],
I wrote a script that would export individual JSON files like the above for each active borrowing record in our system (currently 141,335 records). I ran this on a version of the database stored on my laptop rather than running it on the server to avoid overloading the server. I then created a Solr Core for the data and specified an appropriate schema. This defines each of the above fields and the types of data the fields can hold (e.g. some fields can hold multiple values, such as borrower occupations, some fields are text strings, some are integers, some are dates). I then ran the Solr script that ingests the data.
It took a lot of time to get things working as I needed to experiment with the structure of the JSON files that my script generated in order to account for various complexities in the data. I also encountered some issues with the data that only became apparent at the point of ingest when records were rejected. These issues only affected a few records out of nearly 150,000 so I needed to tweak and re-run the data export many times until all issues were ironed out. As both the data export and the ingest scripts took quite a while to run the whole process took several days to get right.
Some issues encountered include:
- Empty fields in the data resulting in no data for the corresponding JSON field (e.g. “bday”: <nothing here> ) which invalidated the JSON file structure. I needed to update the data export script to ensure such empty fields were not included.
- Solr’s date structure requiring a full date (e.g. 1792-02-16) and partial dates (e.g. 1792) therefore failing. I ended up reverting to an integer field for returned dates as these are generally much more vague and having to generate placeholder days and months where required for the borrowed date.
- Solr’s default (and required) ID field having to be a string rather than an integer, which is what I’d set it to in order to match our BNID field. This was a bit of a strange one as I would have expected an integer ID to be allowed and it took some time to investigate why my nice integer ID was failing.
- Realising more fields should be added to the JSON output as I went on and therefore having to regenerate the data each time (e.g. I added in borrower gender and IDs for borrowers, editions, works and authors )
- Issues with certain characters appearing in the text fields causing the import to break. For example, double quotes needed to be converted to the entity ‘"e;’ as their appearance in the JSON caused the structure to be invalid. I therefore updated the translation, original title and standardised title fields, but then the import still failed as a few borrowers also have double quotes in their names.
However, once all of these issues were addressed I managed to successfully import all 141,355 borrowing records into the Solr instance running on my laptop and was able to experiment with queries, all of which are running very quickly and will serve our needs very well. And now that the data export script is properly working I’ll be able to re-run this and ingest new data very easily in future.
The big issue now is whether we will be allowed to install an Apache Solr instance on a server at Stirling. We would need the latest release of Solr (v9 https://solr.apache.org/downloads.html) to be installed on a server. This requires Java JRE version 11 or higher (https://solr.apache.org/guide/solr/latest/deployment-guide/system-requirements.html). Solr uses the Apache Lucene search library and as far as I know it fires up a Java based server called Jetty when it runs. The deployment guide can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html
When Solr runs a web-based admin interface is available through which the system can be managed and the data can be queried. This would need securing, and instructions about doing so can be found here: https://solr.apache.org/guide/solr/latest/deployment-guide/securing-solr.html
I think basic authentication would be sufficient, ideally with access limited to on-campus / VPN users. Other than for testing purposes there should only be one script that connects to the Solr URL (our API) so we could limit access to the IP address of this server, or if Solr is going to be installed on the same server then limiting access to localhost could work.
In terms of setting up the Solr instance, we would only need a single node installation (not SolrCloud). Once Solr is running we’d need a Core to be created. I have the schema file the core would require and can give instructions about setting this up. I’m assuming that I would not be given command-line access to the server, which would unfortunately mean that someone in Stirling’s IT department would need to execute a few commands for me, including setting up the Core and ingesting the data each time we have a new update.
One downside to using Solr is it is a separate system to the B&B database and will not reflect changes made to the project’s data until we run a new data export / ingest process. We won’t want to do this too frequently as exporting the data takes at least an hour, then transferring the files to the server for ingest will take a long time (uploading hundreds of thousands of small files to a server can take hours. Zipping them up then uploading the zip file and extracting the file also takes a long time). Then someone with command-line access to the server will need to run the command to ingest the data. We’ll need to see if Stirling are prepared to do this for us.
Until we hear more about the chances of using Solr I’ll hold off doing any further work on B&B. I’ve got quite a lot to do for other projects that I’ve been putting off whilst I focus on this issue so I need to get back into that.
Other than the above B&B work I did spent a bit of time on other projects. I answered a query about a potential training event based on Speak For Yersel that Jennifer Smith emailed me about and I uploaded a video to the Speech Star site. I deleted a spurious entry from the Anglo-Norman Dictionary and fixed a typo on the ‘Browse Textbase’ page. I also had a chat with the editor about further developments of the Dictionary Management System that I’m going to start looking into next week. I also began doing some research into semantic web technologies for structuring thesaurus data in preparation for a paper I’ll be giving in Zurich in January.
Finally, I investigated potential updates to the Dictionaries of the Scots Language quotations search after receiving a series of emails from the team, who had been meeting to discuss how dates will be used in the site.
Currently the quotations are stripped of all tags to generate a single block of text that is then stored in the Solr indexing system and queried against when an advanced search ‘quotes only’ search is performed. So for example in a search for ‘driech’ (https://dsl.ac.uk/results/dreich/quotes/full/both/) Solr looks for the term in the following block of text for the entry https://dsl.ac.uk/entry/snd/dreich (block snipped to save space):
<field name=”searchtext_onlyquotes”>I think you will say yourself it is a dreich business.
Sic dreich wark. . . . For lang I tholed an’ fendit.
Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain.
And he’ll no fin his day’s dark ae hue the dreigher for wanting his breakfast on account of sic a cause.
It’s a dreich job howkin’ tatties wi’ the caul’ win’ in yer duds.
Driche and sair yer pain.
And even the ugsome driech o’ this Auld clarty Yirth is wi’ your kiss Transmogrified.
See a blanket of September sorrows unremitting drich and drizzle permeates our light outerwear.
The way Solr handles returning snippets is described on this page: https://solr.apache.org/guide/8_7/highlighting.html and the size of the snippet is set by the hl.fragsize variable, which “Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100.”. We don’t currently override this default so 100 characters is what we use per snippet (roughly – it can extend more than this to ensure complete words are displayed).
The hl.snippets variable specifies the maximum number of highlighted snippets that are returned per entry and this is currently set to 10. If you look at the SND result for ‘Dreich adj’ you will see that there are 10 snippets listed and this is because the maximum number of snippets has been reached. ‘Dreich’ actually occurs many more than 10 times in this entry. We can change this maximum, but I think 10 gives a good sense that the entry in question is going to be important.
As the quotations block of text is just one massive block and isn’t split into individual quotations the snippets don’t respect the boundaries between quotations. So the first snippet for ‘Dreich Adj’ is:
“I think you will say yourself it is a dreich business. Sic dreich wark. . . . For lang I tholed an”
Which actually comprises the text from almost the entire first two quotes, while the next snippet:
“’ fendit. Ay! dreich an’ dowie’s been oor lot, An’ fraught wi’ muckle pain. And he’ll no fin his day’s”
Includes the last word of the second quote, all of the third quote and some of the fourth quote (which doesn’t actually include ‘dreich’ but ‘dreigher’ which is not highlighted).
So essentially while the snippets may look like they correspond to individual quotes this is absolutely not the case and the highlighted word is generally positioned around the middle of around 100 characters of text that can include several quotations. It also means that it is not possible to limit a search to two terms that appear within one single quotation at the moment because we don’t differentiate individual quotations – the search doesn’t know where one quotation ends and the next begins.
I have no idea how Solr works out exactly how to position the highlighted term within the 100 characters, and I don’t think this is something we have any control over. However, I think we will need to change the way we store and query quotations in order to better handle the snippets, allow Boolean searches to be limited to the text of specific quotes rather than the entire block and to enable quotation results to be refined by a date / date range, which is what the team wants.
We’ll need to store each quotation for an entry individually, each with its own date fields and potentially other fields later on such as part of speech. This will ensure snippets will in future only feature text from the quotation in question and will ensure that Boolean searches will be limited to text within individual queries. However, it is a major change and it will require some time and experimentation to get working correctly and it may introduce other unforeseen issues.
I will need to change the way the search data is stored in Solr and I will need to change how the data is generated for ingest into Solr. The display of the search results will need to be reworked as the search will now be based around quotations rather than entries. I’ll need to group quotations into entries and we’ll need to decide whether to limit the number of quotations that get displayed per entry as for something like ‘dreich adj’ we would end up with many tens of quotations being returned, which would swamp the results page and make it difficult to use. It is also likely that the current ranking of results will no longer work as individual quotations will be returned rather than entire entries. The quotations themselves will be ranked, but that’s not going to be very helpful if we still want the results to be grouped by entry. I’ll need to look at alternatives, such as ranking entries by the number of quotations returned.
The DSL team has proposed that a date search could be provided as a filter on the search results page and we would certainly be able to do this, and incorporate other filters such as POS in future. This is something called ‘facetted searching’ and it’s the kind of thing you see in online shops: you view the search results then you see a list of limiting options, generally to the left of the results, often as a series of checkboxes with a number showing how many of the results the filter applies to. The good news is that Solr has these kind of faceting options built in (in fact it is used to power many online shops). More good news is that this fits in with the work I’m already doing for the Books and Borrowing project as discussed at the start of this post, so I’ll be able to share my expertise between both projects.