This was something of an odd week as I tested positive for Covid. I’m not entirely sure how I managed to get it, but I’d noticed on Friday last week that I’d lost my sense of taste and thought it would be sensible to get tested and the result came back positive. I’d been feeling a bit under the weather last week and this continued throughout this week too, but thankfully the virus never affected my chest or throat and I managed to more or less work all week. However, with our household in full-on in isolation our son was off school all week, and will be all next week, which did impact on the work I could do.
My biggest task of the week was to complete the work in preparation for the launch of the second edition of the Historical Thesaurus. This included fixing the full-size timelines to ensure that words that have been updated to have post-1945 end dates display properly. As we had changed the way these were stored to record the actual end date rather than ‘9999’ the end points of the dates on the timeline were stopping short and not having a pointy end to signify ‘current’. New words that only had post-1999 dates were also not displaying properly. Thankfully I managed to get these issues sorted. I also updated the search terms to fix some of the unusual characters that had not migrated over properly but had been replaced by question marks. I then updated the advanced search options to provide two checkboxes to allow a user to limit their search to new word or words that have been updated (or both), which is quite handy, as it means you can fine out all of the new words in a particular decade, for example all of the new words that have a first date some time in the 1980s:
I also tweaked the text that appears beside the links to the OED and added the Thematic Heading codes to the drop-down section of the main category. We also had to do some last-minute renumbering of categories, which affected several hundred categories and subcategories in ’01.02’ and manually moved a couple of other categories to new locations, and after that we were all set for the launch. The new second edition is now fully available, as you can see from the above link.
Other than I worked on a few other projects this week. I helped to migrate a WordPress site for Bryony Randall’s Imprints of New Modernist Editing project, which is now available here: https://imprintsarteditingmodernism.glasgow.ac.uk/ and responded to a query about software purchased from Lisa Kelly in TFTS.
I spent the rest of the week continuing with the redevelopment of the Anglo-Norman Dictionary website. I updated my script that extracts citations and their dates, which I’d started to work on last week. I figured out why my script was not extracting all citations (it was only picking out the citations form the first sense and subsense in each entry rather than all senses) and managed to get all citations out. With dates extracted for each entry I was then able to store the earliest date for each entry and update the ‘browse’ facility to display this date alongside the headword.
With this in place I moved on to looking at the advanced search options. I created the tab-based interface for the various advanced search options and implemented searches for headwords and citations. The headword search works in a very similar way to the quick search – you can enter a term and use wildcards or double quotes for an exact search. You can also combine this with a date search. This allows you to limit your results to only those entries that have a citation in the year or range of years you specify. I would imagine entering a range of years would be more useful than a single year. You can also omit the headword and just specify a citation year to find all entries with a citation in the year or range, e.g. all entries with a citation in 1210.
The citation search is also in place and this works rather differently. As mentioned in the overview document, this works in a similar (but not identical) way to the old ‘concordance search of citations’. You can search for a word or a part of a word using the same wildcards as for the headword and limiting your search to particular citation dates. When you submit the search this then loads an intermediary page that lists all of the word forms in citations that your search matches, plus a count of the number of citations each form is in. From this page you can then select a specific form and view the results. So, for example, a search for words beginning with ‘tre’ with a citation date between 1200 and 1250 lists 331 forms in citations will list all of the ‘tre’ words and you can then choose a specific form, e.g. ‘tref’ to see the results. The citation results include all of the citations for an entry that include the word, with the word highlighted in yellow. I still need to think about how this might work better, as currently there is no quick way to get back to the intermediary list of forms. But progress is being made.
I was back at work this week after having a lovely holiday the previous week. It was a pretty busy week, mostly spent continuing to work on the preparations for the second edition of the Historical Thesaurus, which needs to be launched before the end of the month. I updated the OED date extraction script that formats all of the OED dates as we need them in the HT, including making full entries, in the HT dates table, generating the ‘full date’ text string that gets displayed on the website and generating cached first and last dates that are used for searching. I’d somehow managed to get the plus and dash connectors the wrong way round in my previous version of the script (a plus should be used where there is a gap of more than 150 years, otherwise it’s a dash) so I fixed this. I also stripped out dates that were within a 150 year time span, which really helped to make the full date text more readable. I also updated the category browser so that the category’s thematic heading is displayed in the drop-down section.
Fraser had made some suggested changes to the script I’d written to figure out whether an OED lexeme was new or already in the system so I made some changes to this and regenerated the output. I also made further tweaks to the date extraction script so that we record the actual final date in the system rather than converting it to ‘9999’ and losing this information that will no doubt be useful in future. I then worked on the post-1999 lexemes, which followed a similar set of processes.
With this all in place I could then run a script that would actually import the new lexemes and their associated dates into the HT database. This included changelog codes, new search terms and new dates (cached firstdate and lastdate, fulldate and individual entries in the dates table). A total of 11116 new words were added, although I subsequently noticed there were a few duplicates that had slipped through the net. With these stripped out we had a total of 804,830 lexemes in the HT, and it’s great to have broken through the 800,000 mark. Next week I need to fix a few things (e.g. the fullsize timelines aren’t set up to cope with post-1945 dates that don’t end in ‘9999’ if they’re current) but we’re mostly now good to launch the second edition.
Also this week I worked on setting up a website for the ‘Symposium for Seventeenth-Century Scottish Literature’ for Roslyn Potter in Scottish Literature and set up a subdomain for an art catalogue website for Bryony Randall’s ‘Imprints of the New Modernist Editing’ project. I also helped Megan Coyer out with an issue she was having in transcribing multi-line brackets in Word and travelled to the University to collect a new, higher-resolution monitor and some other peripherals to make working from home more pleasant. I also fixed a couple of bugs in the Books and Borrowing CMS, including one that was resulting in BC dates of birth and death for authors being lost when data was edited. I also spent some time thinking about the structure for the Burns Correspondence system for Pauline Mackay, resulting in a long email with a proposed database structure. I met with Thomas Clancy and Alasdair Whyte to discuss the CMS for the Iona place-names project (it now looks like this is going to have to be a completely separate system from Alasdair’s existing Mull / Ulva system) and replied to Simon Taylor about a query he had regarding the Place-names of Fife data.
I also found some time to continue with the redevelopment of the Anglo-Norman Dictionary website. I updated the way cognate references were processed to enable multiple links to be displayed for each dictionary. I also added in a ‘Cite this entry’ button, which now appears in the top right of the entry that when clicked on opens a pop-up where citation styles will appear (they’re not there yet). I updated the left-hand panel to make it ‘sticky’: If you scroll down a long entry the panel stays visible on screen (unless you’re viewing on a narrow screen like a mobile phone in which case the left-hand panel appears full-width before the entry). I also added in a top bar that appears when you scroll down the screen that contains the site title, the entry headword and the ‘cite’ button. I then began working on extracting the citations, including their dates and text, which will be used for search purposes. I ran an extraction script that extracted about 60,0000 citations, but I released that this was not extracting all of the citations and further work will be required to get this right next week.
This was a four-day week for me as I’d taken Friday off as it was an in-service day at my son’s school before next week’s half-term, which I’ve also taken off. I had rather a lot to try and get done before my holiday so it was a pretty intense week, split mostly between the Historical Thesaurus and the Anglo-Norman Dictionary.
For the Historical Thesaurus I continued with the preparations for the second edition, starting off by creating a little statistics page that lists all of the words and categories that have been updated for the second edition and the changelog code that have been applied to them. Marc had sent a list of all of the category number sequences that we have updated so I then spent a bit of time updating the database to apply the changelog codes to all of these categories. It turns out that almost 200,000 categories have been revised and relocated (out of about 235,000) so it’s pretty much everything. At our meeting last week we had proposed updating the ‘new OED words’ script I’d written last week to separate out some potential candidates into an eighth spreadsheet (these are words that have a slash in them, which now get split up on the slash and each part is compared against the HT’s search words table to see whether they already exist). Whilst working through some of the other tasks I realised that I hadn’t included the unique identifiers for OED lexemes in the output, which was going to make it a bit difficult to work with the files programmatically, especially since there are some occasions where the OED has two identical lexemes in a category. I therefore updated my script and regenerated the output to include the lexeme ID making it possible to differentiate identical lexemes and also making it easier to grab dates for the lexeme in question.
The issue of there being multiple identical lexemes in an OED category was a strange one. For example, one category had two ‘Amber pudding’ lexemes. I wrote a script that extracted all of these duplicates and there are possibly a hundred or so of them, and also other OED lexemes that appear to have no associated dates. I passed these over to Marc and Fraser for them to have a look at. After that I worked on a script to go through each of the almost 12,000 lexemes that we have identified as OED lexemes that are definitely not present in the HT data, extract their OED dates and then format these as HT dates.
The script generates date entries as they would be added to the HT lexeme dates table (used for timelines), the HT fulldate field (used for display) and the HT firstdate and lastdate fields (used for searching). Dates earlier than 1150 are stored as their actual values in the dates table, but are stored at ‘650’ in the ‘firstdate’ field and are displayed as ‘OE’ in the ‘fulldate. Dates after 1945 are stored as ‘9999’ in both the dates table and the ‘lastdate’ field. Where there is a ‘yearend’ in the OED date (i.e. the date is a range) this is stored as the ‘year_b’ in the HT date and appears after a slash in the ‘fulldate’, following the rules for slashes. If the date is the last date then the ‘year_b’ is used as the HT lastdate. If the ‘year_b’ is after 1945 but the ‘year’ isn’t then ‘9999’ is used. So for example ‘maiden-skate’ has a last date of ‘1880/1884’, which appears in the ‘fulldate’ as ‘1880/4’ and the ‘lastdate’ is ‘1884’. Where there is a gap of more than 150 years between dates the connector between dates is a dash and where the gap is less then this it is a plus. One thing that needed further work was how we handle multiple post 1945 dates. In my initial script if there are multiple post 1945 dates then only one of these is carried over as an HT date, and it’s set to ‘9999’. The is because all post-1945 dates are stored as ‘9999’ and having several of these didn’t seem to make sense and confused the generation of the fulldate. There was also an issue with some OED lexemes only having dates after 1945. In my first version of the script these ended up with only one HT date entry of 9999 and 9999 as both firstdate and lastdate, and a fulldate consisting of just a dash, which was not right. After further discussion with Marc I updated the script so that in such cases the date information that is carried over is the first date (even if it’s after 1945) and a dash to show that it is current. For example, ‘ecoregion’ previously had a ‘full date’ of ‘-‘, one HT date of ‘9999’ and a start date of ‘9999’ and in the updated output has a full date of ‘1962-‘, two HT dates and a start date of 1962. Where a lexeme has a single date this also now has a specific end date rather than it being ‘9999’. I passed the output of the script over the Marc and Fraser for them to work with whilst I was on holiday.
For the Anglo-Norman Dictionary I continued to work on the entry page. I added in the cognate references (i.e. references to other dictionaries), which proved to be rather tricky due to the way they have been structured in the Editors’ XML files (in the current live site the cognate references are stored in a separate hash file and are somehow injected into the entry page when it is generated, but we wanted to rationalise this so that the data that appears on the site is all contained in the Editors’ XML where possible). The main issue was with how the links to other online dictionaries were stored, as it was not entirely clear how to generate actual links to specific pages in these resources from them. This was especially true for links to FEW (I have no idea what FEW stands for as the acronym doesn’t appear to be expanded anywhere, even on the FEW website).
They appear in the Editors’ XML like this:
<FEW_refs siglum=”FEW” linkable=”yes”><link_form>A</link_form><link_loc>24,1a</link_loc></FEW_refs>
Which ends up linking to here:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
Which ends up linking to here:
Based on this my script for generating links needed to:
- Store the base URL https://apps.atilf.fr/lecteurFEW/lire/volume
- Split the <link_loc> on the comma
- multiply the part before the comma by 10 (so 24 becomes 240, 9 becomes 90 etc)
- strip out any non-numeric character from the part after the comma (i.e. getting rid of ‘a’ and ‘b’)
- generate the full URL, such as https://apps.atilf.fr/lecteurFEW/lire/volume/240/page/1 using these two values.
After discussion with Heather and Geert at the AND it turned out to be even more complicated than this, as some of the references are further split into subvolumes using a slash and a Roman numeral, so we have things like ‘15/i,108b’ which then needs to link to https://apps.atilf.fr/lecteurFEW/lire/volume/151/page/108. It took some time to write a script that could cope with all of these quirks, but I got there in the end.
Also this week I updated the citation dates so they now display their full information with ‘MS:’ where required and superscript text. I then finished work on the commentaries, adding in all of the required formatting (bold, italic, superscript etc) and links to other AND entries and out to other dictionaries. Where the commentaries are longer than a few lines they are cut off and an ‘expand’ button is shown. I also updated the ‘Results’ tab so it shows you the number of results in the tab header and have added in the ‘entry log’ feature that tracks which entries you have looked at in a session. The number of these also appears in the tab header and I’m personally finding it a very useful feature as I navigate around the entries for test purposes. The log entries appear in the order you opened them and there is no scrolling of entries as I would imagine most people are unlikely to have more than 20 or so listed. You can always clear the log by pressing on the ‘Clear’ button. I also updated the entry page so that the cross references in the ‘browse’ now work. If the entry has a single cross reference then this is automatically displayed when you click on its headword in the ‘browse’, with a note at the top of the page stating it’s a cross reference. If the entry has multiple cross references these are not all displayed but instead links to each entry are displayed. There are two reasons for this: Firstly, displaying multiple entries can result in long and complicated pages that may be hard to navigate; secondly, the entry page as it currently stands was designed to display one entry, and uses HTML IDs to identify certain elements. An HTML ID must be unique on a page so if multiple entries were displayed things would break. There is still a lot of work to do on the site, but the entry page is at least nearing completion. Below is a screenshot showing the entry log, the cognate references and the commentary complete with formatting and the ‘Expand’ option:
I did also work on some other projects this week as well. For Books and Borrowing I set up a user account for a volunteer and talked her through getting access to the system. For the Mull / Ulva site I automatically generated historical forms for all of the place-names that had come from the GB1900 crowdsourced data. These are now associated with the ‘OS 6 inch 2nd edn’ source and about 1670 names have been updated, although many of these are abbreviations like ‘F.P.’. I also updated the database and the CMS to incorporate a new field for deciding which ‘front-end’ the place-name will be displayed on. This is a drop-down list that can be selected when adding or editing a place-name, allowing you to choose from ‘Both’, ‘Mull / Ulva only’ and ‘Iona only’. There is still a further option for stating whether the place-name appears on the website or not (‘on website: Y/N’) so it will be possible to state that a place-name is associated with one project but shouldn’t appear on that project’s website. I also updated the search option on the ‘Browse placenames’ page to allow a user to limit the displayed placenames to those that have ‘front-end display’ set to one of the options. Currently all place-names are set to ‘Mull / Ulva only’. With this all in place I then created user accounts for the CMS for all of the members of the Iona project team who will be using this CMS to work with the data. I also made a few further tweaks to the search results page of the DSL. After all of this I was very glad to get away for a holiday.
I was off on Monday this week for the September Weekend holiday. My four working days were split across many different projects, but the main ones were the Historical Thesaurus and the Anglo-Norman Dictionary.
For the HT I continued with the preparations for the second edition. I updated the front-end so that multiple changelog items are now checked for and displayed (these are the little tooltips that say whether a lexeme’s dates have been updated in the second edition). Previously only one changelog was being displayed but this approach wasn’t sufficient as a lexeme may have a changed start and end date. I also fixed a bug in the assigning of the ‘end date verified as after 1945’ code, which was being applied to some lexemes with much earlier end dates. My script set the type to 3 in all cases where the last HT date was 9999. What it needed to do was to only set it to type 3 if the last HT date was 9999 and the last OED date was after 1945. I wrote a little script to fix this, which affected about 7,400 lexemes.
I also wrote a script to check off a bunch of HT and OED categories that had been manually matched by an RA. I needed to make a few tweaks to the script after testing it out, but after running it on the data we had a further 846 categories matched up, which is great. Fraser had previously worked on a document listing a set of criteria for working out whether an OED lexeme was ‘new’ or not (i.e. unlinked to an HT lexeme). This was a pretty complicated document with many different stages, and the output of the various stages needing to be outputted into seven different spreadsheets and it took quite a long time to write and test a script that would handle all of these stages. However, I managed to complete work on it and after a while it finished executing and resulted in the 7 CSV files, one for each code mentioned in the document. I was very glad that I had my new PC as I’m not sure my old one could have coped with it – for the Levenshtein tests data every word in the HT had to be stored in memory throughout the script’s execution, for example. On Friday I had a meeting with Marc and Fraser where we discussed the progress we’d been making and further tweaks to the script were proposed that I’ll need to implement next week.
For the Anglo-Norman Dictionary I continued to work on the ‘Entry’ page, implementing a mixture of major features and minor tweaks. I updated the way the editor’s initials were being displayed as previously these were the initials of the editor who made the most recent update in the changelog where what was needed were the initials of the person who created the record, contained in the ‘lead’ attribute of the main entry. I also attempted to fix an issue with references in the entry that were set to ‘YBB’. Unlike other references, these were not in the data I had as they were handled differently. I thought I’d managed to fix this, but it looks like ‘YBB’ is used to refer to many different sources so can’t be trusted to be a unique identifier. This is going to need further work.
Minor tweaks included changing the font colour of labels, making the ‘See Also’ header bigger and clearer, removing the final semi-colon from lists of items, adding in line breaks between parts of speech in the summary and other such things. I then spent quite a while integrating the commentaries. These were another thing that weren’t properly integrated with the entries but were added in as some sort of hack. I decided it would be better to have them as part of the editors’ XML rather than attempting to inject them into the entries when they were requested for display. I managed to find the commentaries in another hash file and thankfully managed to extract the XML from this using the Python script I’d previously written for the main entry hash file. I then wrote a script that identified which entry the commentary referred to, retrieved the entry and then inserted the commentary XML into the middle of it (underneath the closing </head> element.
It took somewhat longer than I expected to integrate the data as some of the commentaries contained Greek, and the underlying database was not set up to handle multi-byte UTF-8 characters (which Greek are), meaning these commentaries could not be added to the database. I needed to change the structure of the database and re-import all of the data as simply changing the character encoding of the columns gave errors. I managed to complete this process and import the commentaries and then begin the process of making them appear in the front-end. I still haven’t completely finished this (no formatting or links in the commentaries are working yet) and I’ll need to continue with this next week.
Also this week I added numbers to the senses. This also involved updating the editor’s XML to add a new ‘n’ attribute to the <sense> tag, e.g. <sense id=”AND-201-47B626E6-486659E6-805E33CE-A914EB1F-S001″ n=”1″>. As with the current site, the senses reset to 1 when a new part of speech begins. I also ensured that [sic] now appears, as does the language tag, with a question mark if the ‘cert’ attribute is present and not 100. Uncertain parts of speech are also now visible too (again if ‘cert’ is present and not 100), I increased the font size of the variant forms and citation dates are now visible. There is still a huge amount of work to do, but progress is definitely being made.
Also this week I reviewed the transcriptions from a private library that we are hoping to incorporate into the Books and Borrowing project and tweaked the way ‘additional fields’ are stored to enable the Ras to enter HTML characters into them. I also created a spreadsheet template for a recording the correspondence of Robert Burns for Craig Lamont and spoke to Eila Williamson about the design of the new Names Studies website. I updated the text on the homepage of this site, which Lorna Hughes sent me and gave some advice to Luis Gomes about a data management plan he is preparing. I also updated the working on the search results page for ‘V3’ of the DSL to bring it into line with ‘V2’ and participated in a Zoom call for the Iona project where we discussed the new website and images that might be used in the design.
I’d taken Thursday and Friday off this week as it was the Glasgow September Weekend holiday, meaning this was a three-day week for. It was a week where focussing on any development tasks was rather tricky as I had four Zoom calls and a dentist’s appointment on the other side of the city during my three working days.
On Monday I had a call with the Historical Thesaurus people to discuss the ongoing task of integrating content from the OED for the second edition. There’s still rather a lot to be done for this, and we’re needing to get it all complete during October, so things are a little stressful. After the meeting I made some further updates to the display of icons signifying a second edition update. I updated the database and front-end to allow categories / subcats to have a changelog (in addition to words). These appear in a transparent circle with a white border and a white number, right aligned. I also updated the display of the icon for words. These also appear as a transparent circle, right aligned, but have the teal colour for a border and the number. I also realised I hadn’t added in the icons for words in subcats, so put these in place too.
After that I set about updated the dates of HT lexemes based on some rules that Fraser had developed. I created and ran scripts that updated the start dates of 91,364 lexemes based on OED dates and then ran a further scrip that updated the end dates of 157,156 lexemes. These took quite a while to run (the latter I was dealing with during my time off) but it’s good that progress is being made.
My second Zoom call of the week was for the Books and Borrowing project, and was with the project PI and Co-I and someone who is transcribing library records from a private library that we’re now intending to incorporate into the project’s system. We discussed the data and the library and made a plan for how we’re going to work with the data in future. My third and fourth Zoom call were for the new Place-names of Iona project that is just starting up. It was a good opportunity to meet the rest of the project team (other than the RA who has yet to be appointed) and discuss how and when tasks will be completed. We’ve decided that we’ll use the same content management system as the one I already set up for the Mull and Ulva project, as this already includes Iona data from the GB1900 project. I’ll need to update the system so that we can differentiate place-names that should only appear on the Iona front-end, the Mull and Ulva front-end or both. This is because for Iona we are going to be going into much more detail, down to individual monuments and other ‘microtoponyms’ whereas the names in the Mull and Ulva project are much more high level.
For the rest of my available time this week I made some further updates to the script I wrote last week for Fraser’s Scots Thesaurus project, ordering the results by part of speech and ensuring that hyphenated words are properly searched for (as opposed to being split into separate words joined by an ‘or’). I also spent some time working for the DSL people, firstly updating the text on the search results page and secondly tracking down the certificate for the Android version of the School Dictionary app. This was located on my PC at work, so I had arranged to get access to my office whilst I was already in the West End for my dentist’s appointment. Unfortunately what I thought was the right file turned out to be the certificate for an earlier version of the app, meaning I had to travel all the way back to my office again later in the week (when I was on holiday) to find the correct file.
I also managed to find a little time to continue to work on the new Anglo-Norman Dictionary site, continuing to work on the display of the ‘entry’ page. I updated my XSLT to ensure that ‘parglosses’ are visible and that cross reference links now appear. Explanatory labels are also now in place. These currently appear with a grey background but eventually these will be links to the label search results page. Semantic labels are also now in place and also currently have a grey background but will be links through to search results. However, the System XML notes whether certain semantic labels should be shown or not. So, for example <label type=”sem” show=”no”>med.</label> doesn’t get shown. Unfortunately there is nothing comparable in the Editors’ XML (it’s just <semantic value=”med.”/>) so I can’t hide such labels. Finally, the initials of the editor who made the last update now appear in square brackets to the right of the end of the entry.
Also, my new PC was delivered on Thursday and I spent a lot of time over the weekend transferring all of my data and programs across from my old PC.
This was another busy week involving lots of projects. For the Books and Borrowing project I wrote an import script to process the Glasgow Professors borrowing records, comprising of more than 7,000 rows in a spreadsheet. It was tricky to integrate this with the rest of the project’s data and it took about a day to write the necessary processing scripts. I can only run the scripts on the real data in the evening as I need to take the CMS offline to do so, otherwise changes made to the database whilst I’m integrating the data will be lost and unfortunately it took three attempts to get the import to work properly. There are a few reasons why this data has been particularly tricky. Firstly, it needs to be integrated with existing Glasgow data, rather than being a ‘fresh’ upload to a new library. This caused some problems as my scripts that match up borrowing records and borrowers were getting confused with the existing Student borrowers. Secondly, the spreadsheet order was not in page order for each register – the order appears to have been ‘10r’, ‘10v’, then ‘11r’ etc then after ‘19v’ came ‘1r’. This is presumably to do with Excel ordering numbers as text. I tried reordering on the ‘sort order’ column but this also ordered things weirdly (all the numbers beginning with 1, then all the numbers beginning with 2 etc). I tried changing the data type of this field to a number rather than text but that just resulted in Excel giving errors in all of the fields. What this meant was I needed to sort the data in my own script before I could use it (otherwise the ‘next’ and ‘previous’ page links would all have been wrong), and it took time to implement this. However, I got there in the end.
I also continued working on the Historical Thesaurus database and front-end to allow us to use the new date fields and to enable us to keep track of what lexemes and categories had been updated in the new Second Edition. I have now fully migrated my second edition test site to using the new date system, including the advanced search for labels and both the ‘simple’ and ‘advanced’ date search. I have also now created the database structure for dealing with second edition updates. As we agreed at the last team meeing, the lexeme and category tables have been updated to each have two new fields – ‘last_updated’, which holds a human-readable date (YYYY-MM-DD) that will be automatically populated when rows are updated and ‘changelogcode’ which holds the ID of the row in the new ‘changelog’ table that applies to the lexeme or category. This new table consists of an ID, a ‘type’ (lexeme or category) and the text of the changelog. I’ve created two changelogs for test purposes: ‘This word was antedated in the second edition’ and ‘This word was postdated in the second edition’. I’ve realised that this structure means only one changelog can be associated with a lexeme, with a new one overwriting the old one. A more robust system would record all of the changelogs that have been applied to a lexeme or category and the dates these were applied, and depending on what Marc and Fraser think I may update the system with an extra joining table that would allow this papertrail to be recorded.
For now I’ve updated two lexemes in category 1 to use the two changelogs for test purposes. I’ve updated the category browser in the front end to add in a ‘2’ in a circle where ‘second edition’ changelog IDs are present. These have tooltips that when hovered over display the changelog text and the following screenshot demonstrates:
I haven’t added these circles to the search results yet or the full timeline visualisations, but it is likely that they will need to appear there too.
I also spent some time working on a new script for Fraser’s Scots Thesaurus project. This script allows a user to select an HT category to bring back all of the words contained in it. It then queries the DSL for each of these words and returns a list of those entries that contain at least two of the category’s words somewhere in the entry text. The script outputs the name of the category that was searched for, a list of returned HT words so you can see exactly what is being searched for, and the DSL entries that feature at least two of the words in a table that contains fields such as source dictionary, parts of speech, a link through to the DSL entry, headword etc. I may have to tweak this further next week, but it seems to be working pretty well.
I spent most of the rest of the week working on the redevelopment of the Anglo-Norman Dictionary. We had a bit of a shock at the start of the week because the entire old site was offline and inaccessible. It turned out that the domain name subscription had expired, and thankfully it was possible to renew it and the site became available again. I spent a lot of time this week continuing to work on the entry page, trying to untangle the existing XSLT script and work out how to apply the necessary rules to the editors’ version of the XML, which differs from the system version of the XML that was generated by an incomprehensible and undocumented series of processes in the old system.
I started off with the references located within the variant forms. In the existing site these link through to source texts, with information appearing in a pop-up when the reference is clicked on. To get these working I needed to figure out where the list of source texts was being stored and also how to make the references appear properly. The Editors’ XML and the System XML differ in structure, and only the latter actually contains the text that appears as the link. So, for example, while the latter has:
<cit> <bibl siglum=”Secr_waterford1″ loc=”94.787″><i>Secr</i> <sc>waterford</sc><sup>1</sup> 94.787</bibl> </cit>
The former only has:
<varref> <reference><source siglum=”Secr_waterford1″ target=””><loc>94.787</loc></source></reference> </varref>
This meant that the text to display and its formatting (<i>Secr</i> <sc>waterford</sc><sup>1</sup>) is not available to me. Thankfully I managed to track down an XML file that contained the list of texts, which contained this formatting and also all of the information that should appear in the pop-up that is opened when the link is clicked on, e.g.
<item id=”Secr_waterford1″ cits=”552″>
<siglum><i>Secr</i> <span class=”sc”>WATERFORD</span><sup>1</sup></siglum>
<bibl>Yela Schauwecker, <i>Die Diätetik nach dem ‘Secretum secretorum’ in der Version
von Jofroi de Waterford: Teiledition und lexikalische Untersuchung</i>, Würzburger medizinhistorische Forschungen 92, Würzburg, 2007
<date>c.1300 (text and MS)</date>
I then turned my attention cognate references section, and there were also some issues here with the Editors’ XML not including information that is in the system XML. The structure of the cognate references in the system XML is like this:
<xr_group type=”cognate” linkable=”yes”> <xr><ref siglum=”FEW” target=”90/page/231″ loc=”9,231b”>posse</ref></xr> </xr_group>
Note that there is a ‘target’ attribute that provides a link. The Editor’s XML does not include this information – here’s the same reference:
<FEW_refs siglum=”FEW” linkable=”yes”> <link_form>posse</link_form><link_loc>9,231b</link_loc> </FEW_refs>
There’s nothing in there that I can use to ascertain the correct link to add in. However, I have found a ‘hash’ file called ‘cognate_hash’ that when extracted I found contains a list of cognate references and targets. These don’t include entry identifiers so I’m not sure how they were connected to entries, but by combining the ‘siglum’ and the ‘loc’ it looks like it might be possible to find the target, e.g:
<xr_group type=”cognate” linkable=”yes”>
<ref siglum=”FEW” target=”90/page/231″ loc=”*9,231b”>posse</ref>
I’m not sure why there’s an asterisk, though. I also found another hash file called ‘commentary_hash’ that I guess contains the commentaries that appear in some entries but not in their XML. We’ll probably need to figure out whether we want to properly integrate these with the editor’s XML as well.
I completed work on the ‘cognate references’ section, omitting the links out for now (I’ll add these in later) and then moved on to the ‘summary’ box that contains links through to lower sections of the entry. Unfortunately the ‘sense’ numbers are something else that are not present in any form in the Editor’s XML. In the System XML each entry has a number, e.g. ‘<sense n=”1″>’ but in the Editor’s XML there is no such number. I spent quite a bit of time trying to increment a number in XSLT and apply it to each sense but it turns out you can’t increment a number in XSLT, even though there are ‘for’ loops where such an incrementing number would be easy to implement in other languages.
I still need to add in the non-locution xrefs, labels and some other things, but overall I’m very happy with the progress I’ve made this week. Below is an example of an entry in the old site, with the entry as it currently looks in the new test site I’m working on (be aware that the new interface is only a placeholder). Before:
This was a pretty busy week, involving lots of different projects. I set up the systems for a new place-name project focusing on Ayrshire this week, based on the system that I initially developed for the Berwickshire project and has subsequently been used for Kirkcudbrightshire and Mull. It didn’t take too long to port the system over, but the PI also wanted the system to be populated with data from the GB1900 crowdsourcing project. This project has transcribed every place-name on the GB1900 Ordnance Survey maps across the whole of the UK and is an amazing collection of data totalling some 2.5 million names. I had previously extracted a subset of names for the Mull and Ulva project so thankfully had all of the scripts needed to get the information for Ayrshire. Unfortunately what I didn’t have was the data in a database, as I’d previously extracted it to my PC at work. This meant that I had to run the extraction script again on my home PC, which took about three days to work through all of the rows in the monstrous CSV file. Once this was complete I could then extract the names found in the Ayrshire parishes that the project will be dealing with, resulting in almost 4,000 place-names. However, this wasn’t the end of the process as while the extracted place-names had latitude and longitude they didn’t have grid references or altitude. My place-names system is set up to automatically generate these values and I could customise the scripts to automatically apply the generated data to each of the 4000 places. Generating the grid reference was pretty straightforward but grabbing the altitude was less so, as it involved submitting a query to Google Maps and then inserting the returned value into my system using an AJAX call. I ran into difficulties with my script exceeding the allowed number of Google Map queries and also the maximum number of page requests on our server, resulting in my PC getting blocked by the server and a ‘Forbidden’ error being displayed instead, but with some tweaking I managed to get everything working within the allowed limits.
I also continued to work on the Second Edition of the Historical Thesaurus. I set up a new version of the website that we will work on for the Second Edition, and created new versions of the database tables that this new site connects to. I also spent some time thinking about how we will implement some kind of changelog or ‘history’ feature to track changes to the lexemes, their dates and corresponding categories. I had a Zoom call with Marc and Fraser on Wednesday to discuss the developments and we realised that the date matching spreadsheets I’d generated last week could do with some additional columns from the OED data, namely links through to the entries on the OED website and also a note to say whether the definition contains ‘(a)’ or ‘(also’ as these would suggest the entry has multiple senses that may need a closer analysis of the dates.
I then started to update the new front-end to use the new date structure that we will use for the Second Edition (with dates stored in a separate date table rather than split across almost 20 different date fields in the lexeme table). I updated the timeline visualisations (mini and full) to use this new date table, and although this took quite some time to get my head around the resulting code is MUCH less complicated than the horrible code I had to write to deal with the old 20-odd date columns. For example, the code to generate the data for the mini timelines is about 70 lines long now as opposed to over 400 previously.
The timelines use the new data tables in the category browse and the search results. I also spotted some dates weren’t working properly with the old system but are working properly now. I then updated the ‘label’ autocomplete in the advanced search to use the labels in the new date table. What I still need to do is update the search to actually search for the new labels and also to search the new date tables for both ‘simple’ and ‘complex’ year searches. This might be a little tricky, and I will continue on this next week.
Also this week I gave Gerry McKeever some advice about preserving the data of his Regional Romanticism project, spoke to the DSL people about the wording of the search results page, gave feedback on and wrote some sections for Matthew Creasy’s Chancellor’s Fund proposal, gave feedback to Craig Lamont regarding the structure of a spreadsheet for holding data about the correspondence of Robert Burns and gave some advice to Rob Maslen about the stats for his ‘City of Lost Books’ blog. I also made a couple of tweaks to the content management system for the Books and Borrowers project based on feedback from the team.
I spent the remainder of the week working on the redevelopment of the Anglo-Norman dictionary. I updated the search results page to style the parts of speech to make it clearer where one ends and the next begins. I also reworked the ‘forms’ section to add in a cut-off point for entries that have a huge number of forms. In such cases the long list of cut off and an ellipsis is added in, together with an ‘expand’ button. Pressing on this scrolls down the full list of forms and the button is replaced with a ‘collapse’ button. I also updated the search so that it no longer includes cross references (these are to be used for the ‘Browse’ list only) and the quick search now defaults to an exact match search whether you select an item from the auto-complete or not. Previously it performed an exact match if you selected an item but defaulted to a partial match if you didn’t. Now if you search for ‘mes’ (for example) and press enter or the search button your results are for “mes” (exactly). I suspect most people will select ‘mes’ from the list of options, which already did this, though. It is also still possible to use the question mark wildcard with an ‘exact’ search, e.g. “m?s” will find 14 entries that have three letter forms beginning with ‘m’ and ending in ‘s’.
I also updated the display of the parts of speech so that they are in order of appearance in the XML rather than alphabetically and I’ve updated the ‘v.a.’ and ‘v.n.’ labels as the editor requested. I also updated the ‘entry’ page to make the ‘results’ tab load by default when reaching an entry from the search results page or when choosing a different entry in the search results tab. In addition, the search result navigation buttons no longer appear in the search tab if all the results fit on the page and the ‘clear search’ button now works properly. Also, on the search results page the pagination options now only appear if there is more than one page of results.
On Friday I began to process the entry XML for display on the entry page, which was pretty slow going, wading through the XSLT file that is used to transform the XML to HTML for display. Unfortunately I can’t just use the existing XSLT file from the old site because we’re using the editor’s version of the XML and not the system version, and the two are structurally very different in places.
So far I’ve been dealing with forms and have managed to get the forms listed, with grammatical labels displayed where available and commas separating forms and semi-colons separating groups of forms. Deviant forms are surrounded by brackets. Where there are lots of forms the area is cut off as with the search results. I still need to add in references where these appear, which is what I’ll tackle next week. Hopefully now I’ve started to get my head around the XML a bit progress with the rest of the page will be a little speedier, but there will undoubtedly be many more complexities that will need to be dealt with.
I worked on many different projects this week, and the largest amount of my time went into the redevelopment of the Anglo-Norman Dictionary. I processed a lot of the data this week and have created database tables and written extraction scripts to export labels, parts of speech, forms and cross references from the XML. The data extracted will be used for search purposes, for display on the website in places such as the search results or will be used to navigate between entries. The scripts will also be used when updating data in the new content management system for the dictionary when I write it. I have extracted 85,397 parts of speech, 31,213 cross references, 150,077 forms and their types (lemma / variant / deviant) and 86,269 labels which correspond to one of 157 unique labels (usage or semantic), which I also extracted.
I have also finished work on the quick search feature, which is now fully operational. This involved creating a new endpoint in the API for processing the search. This includes the query for the predictive search (i.e. the drop-down list of possible options that appears as you type), which returns any forms that match what you’re typing in and the query for the full quick search, which allows you to use ‘?’ and ‘*’ wildcards (and also “” for an exact match) and returns all of the data about each entry that is needed for the search results page. For example, if you type in ‘from’ in the ‘Quick Search’ box a drop-down list containing all matching forms will appear. Note that these are forms not only headwords so they include lemmas but also variants and deviants. If you select a form that is associated with one single entry then the entry’s page will load. If you select a form that is associated with more than one entry then the search results page will load. You can also choose to not select an item from the drop-down list and search for whatever you’re interested in. For example, enter ‘*ment’ and press enter or the search button to view all of the forms ending in ‘ment’, as the following screenshot demonstrates (note that this is not the final user interface but one purely for test purposes):
With this example you’ll see that the results are paginated, with 100 results per page. You can browse through the pages using the next and previous buttons or select one of the pages to jump directly to it. You can bookmark specific results pages too. Currently the search results display the lemma and homonym number (if applicable) and display whether the entry is an xref or not. Associated parts of speech appear after the lemma. Each one currently has a tooltip and we can add in descriptions of what each POS abbreviation means, although these might not be needed. All of the variant / deviant forms are also displayed as otherwise it can be quite confusing for users if the lemma does not match the term the user entered but a form does. All associated semantic / usage labels are also displayed. I’m also intending to add in earliest citation date and possibly translations to the results as well, but I haven’t extracted them yet.
When you click on an entry from the search results this loads the corresponding entry page. I have updated this to add in tabs to the left-hand column. In addition to the ‘Browse’ tab there is a ‘Results’ tab and a ‘Log’ tab. The latter doesn’t contain anything yet, but the former contains the search results. This allows you to browse up and down the search results in the same way as the regular ‘browse’ feature, selecting another entry. You can also return to the full results page. I still need to do some tweaking to this feature, such as ensuring the ‘Results’ tab loads by default if coming from a search result. The ‘clear’ option also doesn’t currently work properly. I’ll continue with this next week.
For the Books and Borrowing project I spent a bit of time getting the page images for the Westerkirk library uploaded to the server and the page records created for each corresponding page image. I also made some final tweaks to the Glasgow Students pilot website that Matthew Sangster and I worked on and this is now live and available here: https://18c-borrowing.glasgow.ac.uk/.
There are three new place-name related projects starting up at the moment and I spent some time creating initial websites for all of these. I still need to add in the place-name content management systems for two of them, and I’m hoping to find some time to work on this next week. I also spoke to Joanna Kopaczyk about a website for an RSE proposal she’s currently putting together and gave some advice to some people in Special Collections about a project that they are planning.
On Tuesday I had a Zoom call with the ‘Editing Robert Burns’ people to discuss developing the website for phase two of the Editing Robert Burns project. We discussed how the website would integrate with the existing website (https://burnsc21.glasgow.ac.uk/) and discussed some of the features that would be present on the new site, such as an interactive map of Burns’ correspondence and a database of forged items.
I also had a meeting with the Historical Thesaurus people on Tuesday and spent some time this week continuing to work on the extraction of dates from the OED data, which will feed into a new second edition of the HT. I fixed all of the ‘dot’ dates in the HT data. This is where there isn’t a specific date but a dot is used instead (e.g. 14..) but sometimes a specific year is given in the year attribute (e.g. 1432) but at other times a more general year is given (e.g. 1400). We worked out a set of rules for dealing with these and I created a script to process them. I then reworked my script that extracts dates for all lexemes that match a specific date pattern (YYYY-YYYY, where the first year might be Old English and the last year might be ‘Current’) and sent this to Fraser so that the team can decide which of these dates should be used in the new version of the HT. Next week I’ll begin work on a new version of the HT website that uses an updated dataset so we can compare the original dates with the newly updated ones.
I needed a further two trips to the dentist this week, which lost me some time due to my dentist being the other side of the city from where I live (but very handy for my office at work that I’m not currently allowed to use). Despite these interruptions I managed to get a decent amount done this week. For the Books and Borrowing project I processed the images of a register from Westerkirk library. For this register I needed to stitch together the images of the left and right pages to make a single image, as each spread features a table that covers both pages. As we didn’t want to have to manually join hundreds of images I wrote a script that did this, leaving a margin between the two images as they don’t line up perfectly. I used the command-line tool Imagemagick to achieve this – firstly adding the margin to the left-hand image and secondly joining this to the right-hand image. I then needed to generate tilesets of the images using Zoomify, but when I came to do so the converter processed the images the wrong way round – treating them as portrait rather than landscape and resulting in tilesets that were all wrong. I realised that when joining the page images together the image metadata hadn’t been updated: two portrait images were joined together to make one landscape image, but the metadata still suggested that the image was portrait, which confused the Zoomify converter. I therefore had to run the images through Imagemagick again to strip out all of the metadata and then rotate the images 90 degrees clockwise, which resulted in a set of images I could then upload to the server.
Also this week I made some further tweaks to Matthew Sangster’s pilot project featuring the Glasgow Student data, which we will be able to go live with soon. This involved adding in a couple of missing page images, fixing some encoding issues with Greek characters in a few book titles, fixing a bug that was preventing the links to pages from the frequency lists working, ensuring any rows that are to be omitted from searches were actually being omitted and adding in tooltips for the table column headers to describe what the columns mean.
I also made some progress with the redevelopment of the Anglo-Norman Dictionary. I had a Zoom meeting with the editors on Wednesday, which went very well, and resulted in me making some changes to the system documentation I had previously written. I also worked on an initial structure for the new dictionary website, setting up WordPress for the ancillary pages and figuring out how to create a WordPress theme that is based on Bootstrap. This was something I hadn’t done before and it was a good learning experience. It mostly went pretty smoothly, but getting a WordPress menu to use Bootstrap’s layout was a little tricky. Thankfully someone has already solved the issue and has made the code available to use (see https://github.com/wp-bootstrap/wp-bootstrap-navwalker) so I could just integrate this with my theme.
I completed work on the theme and generated placeholder pages and menu items for all the various parts of the site. The page design is just my temporary page design for now, looking very similar to the Books and Borrowing CMS design, but this will be replaced with something more suitable in time. With this in place I regenerated the XML data from the existing CMS based on the final ‘entry_hash’ data I had. This was even more successful than my first attempt with an earlier version of the data last week and resulted in all but 35 of the 54,025 dictionary entries being generated. This XML has the same structure as the files being used by the editors, so we will now be able to standardise on this structure.
With the new data imported I then started work on an API for the site. This will process all requests for data and will then return the data in either JSON or CSV format (with the front-end using JSON). I created the endpoints necessary to make the ‘browse’ panel work – returning a section of the dictionary as headwords and links either based on entry ‘slugs’ (the URL-safe versions of headwords) or headword text, depending on whether the ‘browse up/down’ option or the ‘jump to’ option is chosen. I also created an endpoint for displaying an entry, which returns all of the data for an entry including its full XML record.
I then began work on the ‘entry’ page in the front-end, focussing initially on the ‘browse’ feature. By the end of the week this was fully operational, allowing the user to scroll up and down the list, select an item to load it or enter text into the ‘jump to’ box. There’s also a pop-up where info about how to use the browse can be added. The ‘jump to’ still needs some work as if you type fast into it it sometimes gets confused as to what content to show. I haven’t done anything about displaying the entry yet, other than displaying the headword. Currently the full versions of both the editor’s and the existing system XML are displayed. Below is a screenshot of how things currently look:
My last task of the week for the AND was to write a script to extract all of the headwords, variants and deviants from the entries to enable the quick search to work. I set the script running and by the time it had finished executing there were more than 150,000 entries in the ‘forms’ table I’d created.
Also this week I helped Rob Maslen to migrate his ‘City of Lost Books’ blog to a new URL, had a chat with the DSL people about updates to the search results page based on the work I did last week and had a chat with Thomas Clancy about three upcoming place-names projects.
I also returned to the Historical Thesaurus project and our ongoing attempts to extract dates from the Oxford English Dictionary in order to update the dates of attestation in the Historical Thesaurus. Firstly, I noticed that there were some issues with end dates for ranged dates before 1000 and I’ve fixed these (there were about 50 or so). Secondly, I noticed there are about 20 dates that don’t have a ‘year’ as presumably the ‘year’ attribute in the XML was empty. Some of these I can fix (and I have), but others also have an empty ‘fullyear’ too, meaning the date tag was presumably empty in the XML and I therefore deleted these.
We still needed to figure out how to handle OED dates that have a dot in them. These are sometimes used (well, used about 4,000 times) to show roughly where a date comes so that it is placed correctly in the sequence of dates (e.g. ’14..’ is given the year ‘1400’). But sometimes a date has a dot and a specific year (e.g. ’14..’ but ‘1436’). We figured out that this is to ensure the correct ordering of the date after an earlier specific date. Fraser therefore wanted these dates to be ‘ante’ the next known date. I therefore wrote a script that finds all lexemes that have at least one date that has a dot and a specific year, then for each of these lexemes it gets all of the dates in order. Each date is displayed, with the ‘fullyear’ displayed first and the ‘year’ in brackets. If the date is a ‘.’ date then it is highlighted in yellow. For each of these the script then tries to find the next date in sequence that isn’t another ‘.’ date (as sometimes there are several). If it finds one then the date becomes this row’s ‘year’ plus ‘a’. If it doesn’t find one (e.g. if the ‘.’ date is the last date for the lexeme) then it retains the year from the ‘.’ date but with ‘a’ added. Next week I will run this script to actually update the data and we will then move on to using the new OED data with the HT’s lexemes.
I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful. Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.
I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback. It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function. Also this week I investigated another bizarre situation with the AND’s data. I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries. The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system. What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system. For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>. Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>. We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere. The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with. It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.
As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors. I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed. There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors. I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.
It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014. I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.
The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS. There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.
I’m not sure why these differences exist. It’s possible I only have access to an older version of the XSLT file. I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script). What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.
I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.
I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each. I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests. After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part. I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.
Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English. Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken. URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards. They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange. The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word. However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.
I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend. This week I investigated OED dates that have a dot in them instead of a full date. There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED. E.g. 138. Is 1380, 17.. is 1700. However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag. For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag. There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’. Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.
In addition to the above I continued to work on the Books and Borrowing project. I made some tweaks to the CMS to make is easier to edit records. When a borrowing record is edited the page automatically scrolls down to the record that was edited. This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library. I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries. What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together. So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname. When stripped and joined together these both become ‘bedfordarthur’ and we have a match.
There are 162 matches that have been identified, some consisting of more than two matched author records. I exported these as a spreadsheet. Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match. This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.
The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row. E.g. for Francis Bacon the AID to keep is given as 1460. If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163. Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly. E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person. To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610. It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column. Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.
I also generated a spreadsheet containing all of the authors that appear to be unique. This will also need checking for other duplicates that haven’t been picked up as there are a few. For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’. Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’. Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’. I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.
Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made). Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible. There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query. This was thankfully easy to fix. There was also an issue with some exact searches of the full text failing to find entries. When the full text is ingested into Solr all of the XML tags are stripped out. If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’. With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’. So an exact search for ‘westminster’ fails to find this entry. A search for ‘westminsterb’ finds the entry, which confirms this. I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.