I spent most of this week working for the Dictionaries of the Scots Language, working on the new quotation date search. I decided to work on the update on a version of the site and its data running on my laptop initially, as I have direct control over the Solr instance running on my laptop – something I don’t have on the server. My first task was to create a new Solr index for the quotations and to write a script to export data from the database in a format that Solr could then index. With over 700,000 quotations this took a bit of time, and I did encounter some issues, such as several tens of thousands of quotations not having date tags, meaning dates for the quotations could not be extracted. I had a lengthy email conversation with the DSL team about this and thankfully it looks like the issue is not something I need to deal with: data is being worked on in their editing system and the vast majority of the dating issues I’d encountered will be fixed the next time the data is exported for me to use. I also encountered some further issues that needed o be addressed as I worked with the data. For example, I realised I needed to add a count of the total number of quotes for an entry to each quote item in Solr to be able to work out the ranking algorithm for entries and this meant updating the export script, the structure of the Solr index and then re-exporting all 700,000 quotations. Below is screenshot of the Solr admin interface, showing a query of the new quotation index – a search for ‘barrow’.
With this in place I then needed to update the API that processes search requests, connects to Solr and spits out the search results in a suitable format for use on the website. This meant completely separating out and overhauling the quotation search, as it needed to connect to a different Solr index that featured data that had a very different structure. I needed to ensure quotations could be grouped by their entries and then subjected to the same ‘max results’ limitations as other searches. I also needed to create the ranking algorithm for entries based on the number of returned quotes vs the total number of quotes, sort the entries based on this and also ensure a maximum of 10 quotes per entry were displayed. I also had to add in a further search option for dates, as I’d already detailed in the requirements document I’d previously written. The screenshot below is of the new quotation endpoint in the API, showing a section of the results for ‘barrow’ in ‘snd’ between 1800 and 1900.
The next step was to update the front-end to add in the new ‘date’ drop-down when quotations are selected and then to ensure the new quotation search information could be properly extracted, formatted and passed to the API to return the relevant data. The following screenshot shows the search form. The explanatory text still needs some work as it currently doesn’t feel very elegant – I think there’s a ‘to’ missing somewhere.
The final step for the week was to deal with the actual results themselves, as they are rather different in structure to the previous results, as entries now potentially have multiple quotes, each of which contains information relating to the quote (e.g. dates, bib ID) and each of which may feature multiple snippets, if the term appears several times within a single quote. I’ve managed to get the results to display correctly and the screenshot below shows the results of a search for ‘barrow’ in snd between 1800 and 1900.
The new search also now lets you perform a Boolean search on the contents of individual quotations rather than all quotations in an entry. So for example you can search for ‘Messages AND Wean’ in quotes from 1980-1999 and only find those that match whereas previously if an entry featured one quote with ‘messages’ and another with ‘wean’ it would get returned. The screenshot below shows the new results.
There are a few things that I need to discuss with the team, though. Firstly the ranking system. As previously agreed upon, entries are ranked based on the proportion of quotes that contain the search term. But this is possibly ranking entries that only have one quote too highly. If there is only one quote and it features the term then 100% of quotes feature the term so the entry is highly ranked, but longer, possibly more important entries are ranked lower because (for example) out of 50 quotes 40 feature the term. We might want to look into weighting entries that have more quotes overall. For example, an SND quotation search for ‘prince’ (see below). ‘Prince’ is ranked first, but then results 2-6 appear because they only have one quote, which happens to feature ‘prince’.
The second issues is that the new system cuts off quotations for entries after the tenth (as you can see for ‘Prince’, above). We’d agreed on this approach to stop entries with lots of quotes swamping the results, but currently nothing is displayed to say that the results have been snipped. We might want to add a note under the tenth quote.
The third issue is that the quote field in Solr is currently stemmed, meaning the stems of words are stored and Solr can then match alternative forms. This can work well – for example the ‘messages AND wean’ results include results for ‘message’ and ‘weans’ too. But it can also be a bit too broad. See for example the screenshot below, which shows a quotation search for ‘aggressive’. As you can see, it has returned quotations that feature ‘aggression’, ‘aggressively’ and ‘aggress’ in addition to ‘aggressive’. This might be useful, but it might cause confusion and we’ll need to discuss this further at some point.
Next week I’ll hopefully start work on the filtering of search results for all search types, which will involve a major change to the way headword searches work and more big changes to the Solr indexes.
Also this week I investigated applying OED DOIs to the OED lexemes we link to in the Historical Thesaurus. Each OED sense now has its own DOI that we can get access to, and I was sent a spreadsheet containing several thousand as an example. The idea is that links from the HT’s lexemes to the OED would be updated to use these DOIs rather than performing a search of the OED for the work, which is what currently happens.
After a few hours of research I reckoned it would be possible to apply the DOIs to the HT data, but there are some things that we’ll need to consider. The OED spreadsheet looks like it will contain every sense and the HT data does not, so much of the spreadsheet will likely not match anything in our system. I wrote a little script to check the spreadsheet against the HT’s OED lexeme table and 6186 rows in the spreadsheet match one (or more) lexeme in the database table while 7256 don’t. I also noted that the combination of entry_id and element_id (in our database called refentry and refid) is not necessarily unique in the HT’s OED lexeme table. This can be if a word appears in multiple categories, plus there is a further ID called ‘lemmaid’ that was sometimes used to differentiate specific lexemes in combination with the other two IDs. In the spreadsheet there are 1180 rows that match multiple rows in the HT’s OED lexeme table. However, this also isn’t a problem and usually just means a word appears in multiple categories. It just means that the same DOI would apply to multiple lexemes.
What is potentially a problem is that we haven’t matched up all of the OED lexeme records with the HT lexeme records. While 6186 rows in the spreadsheet match one or more rows in the OED lexeme table, only 4425 rows in the spreadsheet match one or more rows in the HT’s lexeme table. We will not be able to update the links to switch to DOIs for any HT lexemes that aren’t matched to an OED lexeme. After checking I discovered that there are 87,713 non-OE lexemes in the HT lexeme table that are not linked to an OED lexeme. None of these will be able to have a DOI (and neither will the OE words, presumably).
Another potential problem is that the sense an HT lexeme is linked to is not necessarily the main sense for the OED lexeme. In such cases the DOI then leads to a section of the OED entry that is only accessible to logged in users of the OED site. An example from the spreadsheet is ‘aardvark’. Our HT lexeme links to entry_id 22, element_id 16201412, which has the DOI https://doi.org/10.1093/OED/1516256385 which when you’re not logged in displays a ‘Please purchase a subscription’ page. The other entry for ‘aardvark’ in the spreadsheet has entry_id 22 and element_id 16201390, which has the DOI https://doi.org/10.1093/OED/9531538482 which leads to the summary page, but the HT’s link will be the first DOI above and not the second. Note that currently we link to the search results on the OED site, which actually might be more useful for many people. Aarkvark as found here: https://ht.ac.uk/category/?type=search&qsearch=aardvark&page=1#id=39313 currently links to this OED page: https://www.oed.com/search/dictionary/?q=aard-vark
To summarise: I can update all lexemes in the HT’s OED lexeme table that match the entry_id and element_id columns in the spreadsheet to add in the relevant DOI. I can also then ensure that any HT lexeme records linked to these OED lexemes also feature the DOI, but this will apply to less lexemes due to there still being many HT lexemes that are not linked. I could then update the links through to the OED for these lexemes, but this might not actually work as well as the current link to search results due to many OED DOIs leading to restricted pages. I’ll need to hear back from the rest of the team before I can take this further.
Also this week I had a meeting with Pauline Mackay and Craig Lamont to discuss an interactive map of Burns’ correspondents. We’d discussed this about three years ago and the are now reaching a point where they would like to develop the map. We discussed various options for base maps, data categorisation and time sliders and I gave them a demonstration of the Books and Borrowing project’s Chamber’s library map, which I’d previously developed (https://borrowing.stir.ac.uk/chambers-library-map/). They were pretty impressed with this and thought it would be a good model for their map. Pauline and Craig are now going to work on some sample data to get me started, and once I receive this I’ll be able to begin development. We had our meeting in the café of the new ARC building, which I’d never been to before, so it was a good opportunity to see the place.
Also this week I fixed some issues with images for one of the library registers for the Royal High School for the Books and Borrowing project. These had been assigned the wrong ID in the spreadsheet I’d initially used to generate the data and I needed to write a little script to rectify this.
Finally, I had a chat with Joanna Kopaczyk about a potential project she’s putting together. I can’t say much about it at this stage, but I’ll probably be able to use the systems I developed last year for the Anglo-Norman Dictionary’s Textbase (see https://anglo-norman.net/textbase-browse/ and https://anglo-norman.net/textbase-search/). I’m meeting with Joanna to discuss this further next week.
I was back at work this week after a lovely two-week holiday (although I did spend a couple of hours making updates to the Speech Star website whilst I was away). After catching up with emails, getting back up to speed with where I’d left off and making a helpful new ‘to do’ list I got stuck into fixing the language tags in the Anglo-Norman Dictionary.
In June the editor Geert noticed that language tags had disappeared from the XML files of many entries. Further investigation by me revealed that this probably happened during the import of data into the new AND system and had affected entries up to and including the import of R; entries that were part of the subsequent import of S had their language tags intact. It is likely that the issue was caused by the script that assigns IDs and numbers to <sense> and <senseInfo> tags as part of the import process, as this script edits the XML. Further testing revealed that the updated import workflow that was developed for S retained all language tags, as does the script that processes single and batch XML uploads as part of the DMS. This means the error has been rectified, but we still need to fix the entries that have lost their language tags.
I was able to retrieve a version of the data as it existed prior to batch updates being applied to entry senses and from this I was able to extract the missing language tags for these entries. I was also able to run this extraction process on the R data as it existed prior to upload. I then ran the process on the live database to extract language tags from entries that featured them, for example entries uploaded during the import of S. The script was also adapted to extract the ‘certainty’ attribute from the tags if present. This was represented in the output as the number 50, separated from the language by a bar character (e.g. ‘Arabic|50’). Where an entry featured multiple language tags these were separated by a comma (e.g. ‘Latin,Hebrew’).
Geert made the decision that language tags, which were previously associated with specific senses or subsenses, should instead be associated with entries as a whole. This structural change will greatly simplify the reinstatement of missing tags and it will also make it easier to add language tags to entries that do not already feature them.
The language data that I compiled was stored in a spreadsheet featuring three columns: Slug: the unique form of a headword used in entry URLs; Live Langs: language tags extracted from the live database; Old Langs: language tags extracted from the data prior to processing. A fourth column was also added where manual overrides to the preceding two columns could be added by Geert. This column could also be used to add entries that did not previously have a language tag but needed one.
Two further issues were addressed at this stage. The first related to compound words, where the language applied to one part of the word. In the original data these were represented by combining the language with ‘A.F.’, for example ‘M.E._and_A.F.’. Continuing with this approach would make it more difficult to search for specific languages and the decision was made to only store the non-A.F. language with a note that the word is a compound. This was encoded in the spreadsheet with a bar character followed by ‘Y’. To ensure the data could be more easily machine-readable the compound character would always be the third part of the language data, whether or not certainty was present in the second part. For example ‘M.E.|50|Y’ represents a word that is possibly from M.E. and is a compound while ‘M.E.||Y’ represents a word that is definitely from M.E and is a compound.
The second issue to be addressed was how to handle entries that featured languages but whose language tags were not needed. In such cases Geert added the characters ‘$$’ to the fourth column.
The spreadsheet was edited by Geert and currently features 2741 entries that are to be updated. Each entry in the spreadsheet will be edited using the following workflow:
- All existing language tags in the entry will be deleted. These generally occur in senses or subsenses, but some entries feature them in the <head> element.
- If the entry has ‘$$’ in column 4 then no further updates will be made
- If there is other data in column 4 this will be used
- If there is no data in column 4 then data from column 2 will be used
- If there is no data in columns 4 or 2 then data from column 3 will be usedWhere there are multiple languages separated by a comma these will be split and treated separately.
- For each language the presence of a certainty value and / or a compound will be ascertained
- In the XML the new language tags will appear below the <head> tag.
- An entry will feature one language tag for each language specified
- The specific language will be stored in the ‘lang’ attribute
- Certainty (if present) will be stored in the ‘cert’ attribute which may only contain ‘50’ to represent ‘uncertain’.
- Compound (if present) will be stored in a new ‘compound’ attribute which may only contain ‘true’ to denote the word is a compound.
- For example, ‘Latin|50,Greek|50’ will be stored as two <language> tags beneath the <head> tag as follows: <language lang=”Latin” cert=”50” /><language lang=”Greek” cert=”50” /> while ‘M.E.||Y’ will be stored as: <language lang=”M.E.” compound=”true” />
I ran and tested the update on a local version of the data and the output was checked by Geert and me. After backing up the live database I then ran the update on it and all went well. The dictionary’s DTD also needed to be updated to ensure the new language tag can be positioned as an optional child element of the ‘main_entry’ element. The DTD was also updated to remove language as a child of ‘sense’, ‘subsense’ and ‘head’.
Previously the DTD had a limited list of languages that can appear in the ‘lang’ attribute, but I’m uncertain whether this ever worked as the XML definitely included languages that were not in the list. Instead I created a ‘picklist’ for languages that pulls its data from a list of languages stored in the online database. We use this approach for other things such as semantic labels so it was pretty easy to set up. I also added in the new optional ‘compound’ attribute.
With all of this in place I then updated the XSLT and some of the CSS in order to display the new language tags, which now appear as italicised text above any part of speech. For example, an entry with multiple languages, one of which is uncertain: https://anglo-norman.net/entry/ris_3 and an entry that’s a compound with another language: https://anglo-norman.net/entry/rofgable. Eventually I will update the site further to enable searches for language tags, but this will come at a later date.
Also this week I spent a bit of time in email conversations with the Dictionaries of the Scots Language people, discussing updates to bibliographical entries, the new part of speech system, DOST citation dates that were later than 1700 and making further tweaks to my requirements document for the date and part of speech searches based on feedback received from the team. We’re all in agreement about how the new feature will work now, which means I’ll be able to get started on the development next week, all being well.
I also gave some advice to Gavin Miller about a new proposal he’s currently putting together, helped out Matthew Creasy with the website for his James Joyce Symposium website, spoke to Craig Lamont about the Burns correspondents project and checked how the stats are working on sites that were moved to our newer server a while back (all thankfully seems to be working fine).
I spent the remainder of the week implementing a ‘cite this page’ feature for the Books and Borrowing project, and the feature now appears on every page that features data. A ‘Cite this page’ button appears in the right-hand corner of the page title. Pressing the button brings up a pop-up containing citation options in a variety of styles. I’ve taken this from other projects I’ve been involved with (e.g. the Historical Thesaurus) and we might want to tweak it, but at the moment something along the lines of the following is displayed (full URL crudely ‘redacted’ as the site isn’t live yet):
Developing this feature has taken a bit of time due to the huge variation in the text that describes the page. This can also make the citation rather long, for example:
Advanced search for ‘Borrower occupation: Arts and Letters, Borrower occupation: Author, Borrower occupation: Curator, Borrower occupation: Librarian, Borrower occupation: Musician, Borrower occupation: Painter/Limner, Borrower occupation: Poet, Borrower gender: Female, Author gender: Female’. 2023. In Books and Borrowing: An Analysis of Scottish Borrowers’ Registers, 1750-1830. University of Stirling. Retrieved 18 August 2023, from [very long URL goes here]
I haven’t included a description of selected filters and ‘order by’ options, but these are present in the URL. I may add filters and orders to the description, or we can just leave it as it is and let people tweak their citation text if they want.
The ‘cite this page’ button appears on all pages that feature data, not just the search results. For example register pages and the list of book editions. Hopefully the feature will be useful once the site goes live.