This was my first week back after the Christmas holidays and after catching up with emails I spent the best part of two days fixing the content management system of one of the resources that had been migrated during the end of last year. The Saints Places resource (https://saintsplaces.gla.ac.uk/) is not one I created but I’ve taken on responsibility for it due to my involvement with other place-names resources. The front-end was migrated by Luca and was working perfectly, but he hadn’t touched the CMS, which is understandable given that the project launched more than ten years ago. However, I was contacted during the holidays by one of the project team who said that the resource is still regularly updated and therefore I needed to get the CMS up and running again. This required updates to database query calls and session management and it took quite some time to update and test everything. I also lost an hour or so with a script that was failing to initiate a session, even though the session start code looked identical to other scripts that worked. It turned out that this was due to the character encoding of the script, which had been set to UTF-8 BOM, which meant that hidden characters were being outputted to the browser by PHP before the session was instantiated, which then made the session fail. Thankfully once I realised this it was straightforward to convert the script from UTF-8 BOM to regular UTF-8, which solved the problem.
With this unexpected task out of the way I then returned to my work on the new map interface for the Place-names of Iona project, working through the ‘to do’ list I’d created after our last project meeting just before Christmas. I updated the map legend filter list to add in a ‘select all’ option. This took some time to implement but I think it will be really useful. You can now deselect the ‘select all’ to be left with an empty map, allowing you to start adding in the data you’re interested in rather than having to manually remove all of the uninteresting categories. You can also reselect ‘select all’ to add everything back in again.
I did a bit of work on the altitude search, making it possible to search for an altitude of zero (either on its own or with a range starting at zero such as ‘0-10’). This was not previously working as zero was being treated as empty, meaning the search didn’t run. I’ve also fixed an issue with the display of place-names with a zero altitude – previously these displayed an altitude of ‘nullm’ but they now display ‘0m’. I also updated the altitude filter groups to make them more fine-grained and updated the colours to make them more varied rather than the shades of green we previously had. Now 0-24m is a sandy yellow, 25-49, is light green, 50-74m is dark green, 75-99 is brown and anything over 99 is dark grey (currently no matching data).
I also made the satellite view the default map tileset, with the previous default moved to third in the list and labelled ‘Relief’. This proved to be trickier to update than I thought it would be (e.g. pressing the ‘reset map’ button was still loading the old default even though it shouldn’t have) but I managed to get it sorted. I also updated the map popups so they have a white background and a blue header to match the look of the full record and removed all references to Landranger maps in the popup as these were not relevant. Below is a screenshot showing these changes:
I then moved onto the development of the elements glossary, which I completed this week. This can now be accessed from the ‘Element glossary’ menu item and opens in a pop-up the same as the advanced search and the record. By default elements across all languages are loaded but you can select a specific language from the drop-down list. It’s also possible to cite or bookmark a specific view of the glossary, which will load the map with the glossary open at the required place.
I’ve tried to make better use of space than similar pages on the old place-names sites by using three columns. The place-name elements are links and pressing on one performs a search for the element in question. I also updated the full record popup to link the elements listed in it to the search results. I had intended to link to the glossary rather than the search results, which is what happens in the other place-names sites, but I thought it would be more useful and less confusing to link directly to the search results instead. Below is a screenshot showing the glossary open and displaying elements in Scottish Standard English:
I also think I’ve sorted out the issue with in-record links not working as they should in Chrome and other issues involving bar characters. I’ve done quite a bit of testing with Chrome and all seems fine to me, but I’ll need to wait an see if other members of the team encounter any issues. I also added in the ‘translation’ field to the popup and full record, although there are only a few records that currently have this field populated, relabelled the historical OS maps and fixed a bug in the CMS that was resulting in multiple ampersands being generated when an ampersand was used in certain fields.
My final update for the project this week was to change the historical forms in the full record to hide the source information by default. You now need to press a ‘show sources’ checkbox above the historical forms to turn these on. I think having the sources turned off really helps to make the historical forms easier to understand.
I also spent a bit of time this week on the Books and Borrowing project, including participating in a project team Zoom call on Monday. I had thought that we’d be ready for a final cache generation and the launch of the full website this week, but the team are still making final tweaks to the data and this had therefore been pushed back to Wednesday next week. But this week I updated the ‘genre through time’ visualisation as it turned out that the query that returned the number of borrowing records per genre per year wasn’t quite right and this was giving somewhat inflated figures, which I managed to resolve. I also created records for the first volume of the Leighton Library Minute Books. There will be three such volumes in total, all of which will feature digitised images only (no transcriptions). I processed the images and generated page records for the first volume and will tackle the other two once the images are ready.
Also this week I made a few visual tweaks to the Erskine project website (https://erskine.glasgow.ac.uk/) and I fixed a misplaced map marker in the Place-names of Berwickshire resource (https://berwickshire-placenames.glasgow.ac.uk/). For some reason the longitude was incorrect for the place-name, even though the latitude was fine, which resulted in the marker displaying in Wales. I also fixed a couple of issues with the Old English Thesaurus for Jane Roberts and responded to a query from Jennifer Smith regarding the Speak For Yersel resource.
Finally, I investigated an issue with the Anglo-Norman Dictionary. An entry was displaying what appeared to be an erroneous first date so I investigated what was going on. The earliest date for the entry was being generated from this attestation:
<attestation id="C-e055cdb1"><dateInfo> <text_date post="1390" pre="1314" cert="">1390-1412</text_date> <ms_date post="1400" pre="1449" cert="">s.xv<sup>1</sup></ms_date> </dateInfo> <quotation>luy donantz aussi congié et eleccion d’estudier en divinitee ou en loy canoun a son plesir, et ce le plus favorablement a cause de nous</quotation> <reference><source siglum="Lett_and_Pet" target=""><loc>412.19</loc></source></reference> </attestation>
Specifically the text date:
<text_date post="1390" pre="1314" cert="">1390-1412</text_date>
This particular attestation was being picked as the earliest due to a typo in the ‘pre’ date which is 1314 when it should be 1412. Where there is a range of dates the code generates a single year at the midpoint that is used as a hidden first date for ordering purposes (this was agreed upon back when we were first adding in first dates of attestation). The code to do this subtracts the ‘post’ date from the ‘pre’ date, divides this in two and then adds it to the ‘post’ date, which finds the middle point. With the typo the code therefore subtracts 1390 from 1314, giving -76. This is divided in two giving -38. This is then added onto the ‘post’ date of 1390, which gives 1352. 1352 is the earliest date for any of the entry’s attestations and therefore the earliest display date is set to ‘1390-1412’. Fixing the typo in the XML and processing the file would therefore rectify the issue.
After spending much of my time over the past three weeks adding genre to the Books and Borrowing project I turned my attention to other projects for most of this week. One of my main tasks was to go through the feedback from the Dictionaries of the Scots Language people regarding the new date and quotation searches I’d developed back in September. There was quite a lot to go through, fixing bugs and updating the functionality and layout of the new features. This included fixing a bug with the full text Boolean search, which was querying the headword field rather than the full text and changing the way quotation search ranking works. Previously quotation search results were ranked by the percentage of matching quotes, and if this was the same then the entry with the largest number of quotes would appear higher. Unfortunately this meant that entries with only one quote ended up ranked higher than entries with large numbers of quotes, not all of which contained the term. I updated this so that the algorithm now counts the number of matching quotes and ranks primarily on this, only using the percentage of matching quotes when two entries have the same number of matching quotes. So now a quotation search for ‘dreich’ ranks what are hopefully the most important entries first.
I also updated the display of dates in quotations to make them bold and updated the CSV download option to limit the number of fields that get returned. I also noticed that when a quotation search exceeded the maximum number of allowed results (e.g. ‘heid’) it was returning no results due to a bug in the code, which I fixed. I also fixed a bug that was stopping wildcards in quick searches from working as intended and fixed an issue with the question mark wildcard in the advanced headword search.
I then made updates to the layout of the advanced search page, including adding placeholder ‘YYYY’ text to the year boxes, adding a warning about the date range when dates provided are beyond the scope of the dictionaries and overhauling the search help layout down the right of the search form. The help text scroll down/up was always a bit clunky so I’ve replaced it with what I think is a neater version. You can see this, and the year warning in the following screenshot:
I also tweaked the layout of the search results page, including updating the way the information about what was search for is displayed, moving some text to a tooltip, moving the ‘hide snippets’ option to the top menu bar and ensuring the warning that is displayed when too many results are returned appears directly above the results. You can see all of this in the following screenshot:
I then moved onto updates to the sparklines. The team decided they wanted the gap length between attestations to be increased from 25 to 50 years. This would mean individual narrow lines would then be grouped into thicker blocks. They also wanted the SND sparkline to extend to 2005, whereas previously it was cut off at 2000 (with any attestations after this point given the year 2000 in the visualisation). These updates required me to make changes to the scripts that generate the Solr data and to then regenerate the data and import it into Solr. This took some time to develop and process, and currently the results are only running on my laptop as it’s likely the team will want further changes made to the data. The following screenshot shows a sparkline when the gap length was set to 25 years:
And the following screenshot shows the same sparkline with the gap length set to 50 years:
I also updated the dates that are displayed in an entry beside the sparkline to include the full dates of attestation as found in the sparkline tooltip rather than just displaying the first and last dates of attestation.
I completed going through the feedback and making updates on Wednesday and now I need to want and see whether further updates are required before we go live with the new date and quotation search facilities.
I spent the rest of the week working on various projects. I made a small tweak to remove an erroneous category from the Old English Thesaurus and dealt with a few data issues for the Books and Borrowing project too, including generating spreadsheets of data for checking (e.g. list of all of the distinct borrower titles) and then making updates to the online database after these spreadsheets had been checked. I also fixed a bug with the genre search, which was joining multiple genre selections with Boolean AND when it should have been joining them with Boolean OR.
I also returned to working for the Anglo-Norman Dictionary. This included updating the XSLT so that legiturs in variant lists displayed properly (see ‘la noitement (l. l’anoitement))’ here: https://anglo-norman.net/entry/anoitement). Whilst sorting this out I noticed that some entries would appear to have multiple ‘active’ records in the database – a situation that should not have happened. After spotting this I did some frantic investigation to understand what was going on. Thankfully it turned out that the issue has only affected 23 entries, with all but two of them having two active records. I’m not sure what happened with ‘bland’ to result in 36 active records, and ‘anoitement’ with 9, but I figured out a way to resolve the issue and ensure it doesn’t happen again in future. I updated the script that publishes holding area entries to ensure any existing ‘active’ records are removed when the new record is published. Previously the script was only dealing with one ‘active’ entry (as that is all there should have been), which I think may have been how the issue cropped up. In future the duplicate issue will rectify itself whenever one of the records with duplicate active records is edited – at the point of publication all existing ‘active’ records will be moved to the ‘history’ table.
Also for the AND this week I updated the DTD to ensure that superscript text is allowed in commentaries. I also removed the embedded Twitter feed from the homepage as it looks like this facility has been permanently removed by Twitter / X. I’ve also tweaked the logo on narrow screens so it doesn’t display so large, which should make the site better to use on mobile phones and I fixed an issue with the entry proofreader which was referencing an older version of jQuery that no longer existed. I also fixed the dictionary’s ‘browse up’ facility, which had broken.
I also found some time to return to working on the new map interface for the Iona place-names project and have now added in the full record details. When you press on a marker to open the popup there is now a ‘View full record’ button. Pressing on this opens an overlay the same as the ‘Advanced search’ that contains all of the information about the record, in the same way as the record page on the other place-name resources. This is divided into a tab for general information and another for historical forms as you can see from the following screenshot:
Finally this week I kept project teams updated on another server move that took place overnight on Thursday. This resulted in downtime for all affected websites, but all was working again the next morning. I needed to go through all of the websites to ensure they were working as intended after the move, and thankfully all was well.
Monday was a holiday this week, so ordinarily this would have been a four-day week for me. However, I was unfortunately picked for jury duty and I was obliged to attend court on Wednesday and Friday, making it a two-day week. I’m also going to have to attend court on Tuesday next week as well (after next Monday’s coronation holiday) but hopefully that will be an end to the disruption.
On Tuesday this week I spent a bit of time working on the migration of sites to external hosting and spent the remainder of the day adding the new MRI 2 recordings to the IPA chart on the Speech Star website. I uploaded all of the videos and added in a new ‘MRI 2’ video type. I then uploaded and integrated all of the metadata. It took quite a long time to get all of this working (pretty much all day), adding the data to all four of the IPA charts, but I got it all done. I will need to update the charts on the Seeing Speech website too once everyone is happy with how the charts look.
On Thursday I made some further tweaks to the Edinburgh’s Enlightenment map and migrated three further sites to external hosting. I also spent some time updating the shared spreadsheet we’re using to keep track of the Arts websites, adding in contact details for all of the sites I’m responsible for and making a note of the sites I’ve migrated.
I also made some tweaks to the Speech Star feedback pages I’d created last week, populated a few pages of the Speech Star website with content from Seeing Speech, added content to the ‘contact us’ page, fixed some broken links that people had spotted in the site, swapped a couple of video files around that needed fixed in the charts and added explanatory text to the extIPA chart page. I also added in some new symbols to the IPA charts for sounds that were not present on the original versions but we now have videos for in the MRI 2 data.
I also investigated a strange issue that Jane Roberts had encountered when adding works to the Old English Thesaurus using the CMS. Certain combinations of characters in the ‘notes’ field were getting blocked by Apache, and once I’d figured this out we were able to address the issue.
I also spent a bit of time on the Books and Borrowing project, running a query and generating data about all of the book holding records that currently have no associated book edition record in the system (there are about 10,000 such records). We had also received the images for the final two registers in the Advocates Library from the NLS digitisation unit and I spent some time downloading these, processing the images to remove blank pages and update the filenames, uploading the images to our server and then running a script to generate register and page records for each page in both registers. These should be the last registers that need to get added to the system so it’s something of a milestone.
This was a four-day week as the latest round of UCU strike action began on Wednesday. Strike action if going to continue for the next two months, which is going to have a major impact on what I can achieve each week.
I spent almost all of this week working on the Books and Borrowing project. This first two days were mainly spent dealing with data related issues. This included writing a script to merge duplicate editions based on a spreadsheet of editions that I’d previously sent Matt to which he had added a column to denote which duplicate should be merged with which. It took quite some time to write the script due to having to deal with associated book works and authors. Some of the duplicates that were to be deleted had book work associations whilst the edition to keep didn’t. These cases had to be checked for and the book work association had to be transferred over.
Authors were a little more complicated as both the duplicate to be deleted and the one to keep may have multiple associated authors. If the duplicate edition to keep had no authors but the one to be deleted did then each of these had to be associated with the edition to keep. But if both the edition to delete and the one to keep had authors only those authors from the ‘to delete’ edition that were not already represented in the ‘to keep’ edition’s author list had to be associated. In such cases where an author did need to be associated with the ’to keep’ edition I also added in a further check to ensure the author being associated didn’t have the same name (but different ID) as one already associated, as there are duplicate authors in the system.
With all of this done the script then had to reassign the holding records from the ‘to delete’ edition to the ‘to keep’ one and then finally delete the relevant edition. As the script makes significant changes to the data I first ran it on a version of the data I had running on my laptop to check that the script worked as intended, which thankfully it did. After completing the test I then (after taking another backup of the database in case of problems) ran the script on the live data. The process resulted in 541 duplicate editions being deleted from the system and as far as I can tell all is well. We now have 13,086 editions in the system and 13,014 of these do not have an associated book work. We only have 75 book works in the system.
The next step is to assign book works to editions and add in book genres. In order to do this I created a further spreadsheet containing the editions with columns for book work, authors and three columns which can be used to record up to three genres. I also sent Matt and Katie a further spreadsheet containing the details of the 75 existing book works in our system. It’s going to be rather complicated to fill in the spreadsheet as there’s a lot going on and it took me quite a while to figure out a workflow for filling it in. Hopefully with that in place filling it in should be straightforward, if time-consuming.
I also ran some queries, did some checks and generated some spreadsheets for the Wigtown data for Gerry McKeever. With these data related issues out of the way I then returned to developing the front-end. Whilst working on an issue relating to ordering the results by date I noticed that we have quite a lot of borrowing records in the system that have no dates. There are almost 12,000 that don’t have a ‘borrowed year’. There’s possibly a good reason for it, but of these 2,376 have a borrowed day and a borrowed month but no year, which seems more strange. I emailed Katie and Matt about this and they’re going to investigate.
I managed to finish work on the ‘Year borrowed’ bar chart this week. Without providing a year filter the bar chart shows the distribution of borrowing records divided into decades, for example this search for ‘rome’, ordered by date borrowed:
You can then click on one of the decade bars to limit the results to just those in the chosen decade, for example clicking on the ‘1780’ bar:
This then displays a bar chart showing a breakdown of borrowing records per year within the selected decade. You are given the option of clearing the year filter to return to the full view and you can also click on an individual year bar to limit the results to just that year, for example limiting to the year 1788:
When you reach this level no bar chart is displayed as year is the unit that’s filtered and there is only one year selected. But options are given to return to the decade view or clear the year filter. You can of course combine the year filter with any of the other filter options. I guess at year level we could display a similar bar chart for borrowings per month, but this might be too fine-grained and confusing (plus would be a lot more work as everything is currently set up to work with year only). It’s something to consider, though.
I did spot a problem with the bar chart: I realised that when you searched for an individual year or a range within an individual year the results were still showing the options to view the decade and clear the year filter, both of which then gave errors. This has now been sorted – no year filter options should be shown when the main search is only for a year.
For the remainder of the week I began working on the advanced search. As specified in the requirements document, currently the advanced search page features two tabs, one for a ‘simple’ advanced search and one for an ‘advanced’ advanced search. So far I’ve just been working on the forms, which in turn has necessitated making some changes to the API (to bring back a simple list of all libraries and to enable an entire list of registers to be returned). The forms allow you to select / deselect libraries and select / deselect all. In the ‘Simple’ tab there are also textboxes for entering date of borrowing, author forename and surname, year of birth / death and book title, plus a placeholder for genre. The requirements document stated that date of borrowing would have boxes for entering years and days and a drop-down list for selecting month, with two sets to be used for range dates. I’ve decided that since the quick search already allows dates to be entered directly as text that it would make sense to just follow the same method for the advanced search.
Author dates as currently specified are going to be a bit messy for BC dates, where people need to enter a negative value. This is messy because a dash is used for date ranges so we may end up with something like ‘-1000–200’ (that’s two dashes in the middle). I’m not sure what we can do about this, though. I guess having different boxes for ‘from’ and ‘to’ for ranged dates would avoid the issue. For the ‘advanced’ advanced search lists of selectable registers will appear depending on the libraries that are selected. This is what I’m still in the middle of working on.
If I have the time I would like to create a new theme for the website that will look pretty similar but will use the Bootstrap front-end toolkit (https://getbootstrap.com/). The current WordPress theme doesn’t use this which means creating complex layouts is more difficult and messy. I created a Bootstrap based WordPress theme for the Anglo-Norman Dictionary (e.g. this search form: https://anglo-norman.net/textbase-search/) but I’ll just have to see how much time I have as I think it’s better to get the essentials in place first. But what it means is in the meantime things like the search form layout will possibly not be finalised (but will be functional).
It turns out that the code I’d written to generate the data for the quotations was only set to pick up the direct contents of <q> and to ignore the contents of any child elements such as <i>. This is not the case with the full text and ‘exclude quotations’ data. I identified the issue and updated the code, running a test entry through it to test that the italicised text in quotes is now getting indexed properly. It may well be that there was a reason why the code was set up in this way, though, as Ann mentioned that there are other tags within quotes whose content should be ignored. I’ll need further input from the team before I do anything further about this.
I participated in an event about Digital Humanities in the College of Arts that Luca had organised on Monday, at which I discussed the Books and Borrowing project. It was a good event and I hope there will be more like it in future. Luca also discussed a couple of his projects and mentioned that the new Curious Travellers project is using Transkribus (https://readcoop.eu/transkribus/) which is an OCR / text recognition tool for both printed text and handwriting that I’ve been interested in for a while but haven’t yet needed to use for a project. I will be very interested to hear how Curious Travellers gets on with the tool in future. Luca also mentioned a tool called Voyant (https://voyant-tools.org/) that I’d never heard of before that allows you to upload a text and then access many analysis and visualisation tools. It looks like it has a lot of potential and I’ll need to investigate it more thoroughly in future.
Also this week I had to prepare for and participate a candidate shortlisting session for a new systems developer post in the College of Arts and Luca and I had a further meeting with Liz Broe of College of Arts admin about security issues relating to the servers and websites we host. We need to improve the chain of communication from Central IT Services to people like me and Luca so that security issues that are identified can be addressed speedily. As of yet we’ve still not heard anything further from IT Services so I have no idea what these security issues are, whether they actually relate to any websites I’m in charge of and whether these issues relate to the code or the underlying server infrastructure. Hopefully we’ll hear more soon.
The above took a fair bit of time out of my week and I spent most of the remainder of the week working on the Books and Borrowing project. One of the project RAs had spotted an issue with a library register page appearing out of sequence so I spent a little time rectifying that. Other than that I continued to develop the front-end, working on the quick search that I had begun last week and by the end of the week I was still very much in the middle of working through the quick search and the presentation of the search results.
I have an initial version of the search working now and I created an index page on the test site I’m working on that features a quick search box. This is just a temporary page for test purposes – eventually the quick search box will appear in the header of every page. The quick search does now work for both dates using the pattern matching I discussed last week and for all other fields that the quick search needs to cover. For example, you can now view all of the borrowing records with a borrowed date between February 1790 and September 1792 (1790/02-1792/09) which returns 3426 borrowing records. Results are paginated with 100 records per page and options to navigate between pages appear at the top and bottom of each results page.
The search results currently display the complete borrowing record for each result, which is the same layout as you find for borrowing records on a page. The only difference is additional information about the library, register and page the borrowing record appears on can be found at the top of the record. These appear as links and if you press on the page link this will open the page centred on the selected borrowing record. For date searches the borrowing date for each record is highlighted in yellow, as you can see in the screenshot below:
The non-date search also works, but is currently a bit too slow. For example a search for all borrowing records that mention ‘Xenophon’ takes a few seconds to load, which is too long. Currently non-date quick searches do a very simple find and replace to highlight the matched text in all relevant fields. This currently makes the matched text upper case, but I don’t intend to leave it like this. You can also search for things like the ESTC too.
However, there are several things I’m not especially happy about:
- The speed issue: the current approach is just too slow
- Ordering the results: currently there are no ordering options because the non-date quick search performs five different queries that return borrowing IDs and these are then just bundled together. To work out the ordering (such as by date borrowed, by borrower name) many more fields in addition to borrowing ID would need to be returned, potentially for thousands of records and this is going to be too slow with the current data structure
- The search results themselves are a bit overwhelming for users, as you can see from the above screenshot. There is so much data it’s a bit hard to figure out what you’re interested in and I will need input from the project team as to what we should do about this. Should we have a more compact view of results? If so what data should be displayed? The difficulty is if we omit a field that is the only field that includes the user’s search term it’s potentially going to be very confusing
- This wasn’t mentioned in the requirements document I wrote for the front-end, but perhaps we should provide more options for filtering the search results. I’m thinking of facetted searching like you get in online stores: You see the search results and then there are checkboxes that allow you to narrow down the results. For example, we could have checkboxes containing all occupations in the results allowing the user to select one or more. Or we have checkboxes for ‘place of publication’ allowing the user to select ‘London’, or everywhere except ‘London’.
- Also not mentioned, but perhaps we should add some visualisations to the search results too. For example, a bar graph showing the distribution of all borrowing records in the search results over time, or another showing occupations or gender of the borrowings in the search results etc. I feel that we need some sort of summary information as the results themselves are just too detailed to easily get an overall picture of.
I came across the Universal Short Title Catalogue website this week (e.g. https://www.ustc.ac.uk/explore?q=xenophon) it does a lot of the things I’d like to implement (graphs, facetted search results) and it does it all very speedily with a pleasing interface and I think we could learn a lot from this.
Whilst thinking about the speed issues I began experimenting with Apache Solr (https://solr.apache.org/) which is a free search platform that is much faster than a traditional relational database and provides options for facetted searching. We use Solr for the advanced search on the DSL website so I’ve had a bit of experience with it. Next week I’m going to continue to investigate whether we might be better off using it, or whether creating cached tables in our database might be simpler and work just as well for our data. But if we are potentially going to use Solr then we would need to install it on a server at Stirling. Stirling’s IT people might be ok with this (they did allow us to set up a IIIF server for our images, after all) but we’d need to check. I should have a better idea as to whether Solr is what we need by the end of next week, all being well.
Also this week I spent some time working on the Speech Star project. I updated the database to highlight key segments in the ‘target’ field which had been highlighted in the original spreadsheet version of the data by surrounding the segment with bar characters. I’d suggested this as when exporting data from Excel to a CSV file all Excel formatting such as bold text is lost, but unfortunately I hadn’t realised that there may be more than one highlighted segment in the ‘target’ field. This made figuring out how to split the field and apply a CSS style to the necessary characters a little trickier but I got there in the end. After adding in the new extraction code I reprocessed the data, and currently the key segment appears in bold red text, as you can see in the following screenshot:
I also spent some time adding text to several of the ancillary pages of the site, such as the homepage and the ‘about’ page and restructured the menus, grouping the four database pages together under one menu item.
Also this week I tweaked the help text that appears alongside the advanced search on the DSL website and fixed an error with the data of the Thesaurus of Old English website that Jane Roberts had accidentally introduced.
I spent most of my time this week getting back into the development of the front-end for the Books and Borrowing project. It’s been a long time since I was able to work on this due to commitments to other projects and also due to there being a lot more for me to do than I was expecting regarding processing images and generating associated data in the project’s content management system over the summer. However, I have been able to get back into the development of the front-end this week and managed to make some pretty good progress. The first thing I did was to make some changes to the ‘libraries’ page based on feedback I received ages ago from the project’s Co-I Matt Sangster. The map of libraries used clustering to group libraries that are close together when the map is zoomed out, but Matt didn’t like this. I therefore removed the clusters and turned the library locations back into regular individual markers. However, it is now rather difficult to distinguish the markers for a number of libraries. For example, the markers for Glasgow and the Hunterian libraries (back when the University was still on the High Street) are on top of each other and you have to zoom in a very long way before you can even tell there are two markers there.
I also updated the tabular view of libraries. Previously the library name was a button that when clicked on opened the library’s page. Now the name is text and there are two buttons underneath. The first one opens the library page while the second pans and zooms the map to the selected library, whilst also scrolling the page to the top of the map. This uses Leaflet’s ‘flyTo’ function which works pretty well, although the map tiles don’t quite load in fast enough for the automatic ‘zoom out, pan and zoom in’ to proceed as smoothly as it ought to.
After that I moved onto the library page, which previously just displayed the map and the library name. I updated the tabs for the various sections to display the number of registers, books and borrowers that are associated with the library. The Introduction page also now features the information recorded about the library that has been entered into the CMS. This includes location information, dates, links to the library etc. Beneath the summary info there is the map, and beneath this is a bar chart showing the number of borrowings per year at the library. Beneath the bar chart you can find the longer textual fields about the library such as descriptions and sources. Here’s a screenshot of the page for St Andrews:
I also worked on the ‘Registers’ tab, which now displays a tabular list of the selected library’s registers, and I also ensured that when you select one of the tabs other than ‘Introduction’ the page automatically scrolls down to the top of the tabs to avoid the need to manually scroll past the header image (but we still may make this narrower eventually). The tabular list of registers can be ordered by any of the columns and includes data on the number of pages, borrowers, books and borrowing records featured in each.
When you open a register the information about it is displayed (e.g. descriptions, dates, stats about the number of books etc referenced in the register) and large thumbnails of each page together with page numbers and the number of records on each page are displayed. The thumbnails are rather large and I could make them smaller, but doing so would mean that all the pages end up looking the same – beige rectangles. The thumbnails are generated on the fly by the IIIF server and the first time a register is loaded it can take a while for the thumbnails to load in. However, generated thumbnails are then cached on the server so subsequent page loads are a lot quicker. Here’s a screenshot of a register page for St Andrews:
One thing I also did was write a script to add in a new ‘pageorder’ field to the ‘page’ database table. I then wrote a script that generated the page order for every page in every register in the system. This picks out the page that has no preceding page and iterates through pages based on the ‘next page’ ID. Previously pages in lists were ordered by their auto-incrementing ID, but this meant that if new pages needed to be inserted for a register they ended up stuck at the end of the list, even though the ‘next’ and ‘previous’ links worked successfully. This new ‘pageorder’ field ensures lists of pages are displayed in the proper order. I’ve updated the CMS to ensure this new field is used when viewing a register, although I haven’t as of yet updated the CMS to regenerate the ‘pageorder’ for a register if new pages are added out of sequence. For now if this happens I’ll need to manually run my script again to update things.
Anyway, back to the front-end: The new ‘pageorder’ is used in the list of pages mentioned above so the thumbnails get displaying in the correct order. I may add pagination to this page, as all of the thumbnails are currently on one page and it can take a while to load, although these days people seem to prefer having long pages rather than having data split over multiple pages.
The final section I worked on was the page for viewing an actual page of the register, and this is still very much in progress. You can open a register page by pressing on its thumbnail and currently you can navigate through the register using the ‘next’ and ‘previous’ buttons or return to the list of pages. I still need to add in a ‘jump to page’ feature here too. As discussed in the requirements document, there will be three views of the page: Text, Image and Text and Image side-by-side. Currently I have implemented the image view only. Pressing on the ‘Image view’ tab opens a zoomable / pannable interface through which the image of the register page can be viewed. You can also make this interface full screen by pressing on the button in the top right. Also, if you’re viewing the image and you use the ‘next’ and ‘previous’ navigation links you will stay on the ‘image’ tab when other pages load. Here’s a screenshot of the ‘image view’ of the page:
Also this week I wrote a three-page requirements document for the redevelopment of the front-ends for the various place-names projects I’ve created using the system originally developed for the Berwickshire place-names project which launched back in 2018. The requirements document proposes some major changes to the front-end, moving to an interface that operates almost entirely within the map and enabling users to search and browse all data from within the map view rather than having to navigate to other pages. I sent the document off to Thomas Clancy, for whom I’m currently developing the systems for two place-names projects (Ayr and Iona) and I’ll just need to wait to hear back from him before I take things further.
I also responded to a query from Marc Alexander about the number of categories in the Thesaurus of Old English, investigated a couple of server issues that were affecting the Glasgow Medical Humanities site, removed all existing place-name elements from the Iona place-names CMS so that the team can start afresh and responded to a query from Eleanor Lawson about the filenames of video files on the Seeing Speech site. I also made some further tweaks to the Speak For Yersel resource ahead of its launch next week. This included adding survey numbers to the survey page and updating the navigation links and writing a script that purges a user and all related data from the system. I ran this to remove all of my test data from the system. If we do need to delete a user in future (either because their data is clearly spam or a malicious attempt to skew the results, or because a user has asked us to remove their data) I can run this script again. I also ran through every single activity on the site to check everything was working correctly. The only thing I noticed is that I hadn’t updated the script to remove the flags for completed surveys when a user logs out, meaning after logging out and creating a new user the ticks for completed surveys were still displaying. I fixed this.
I also fixed a few issues with the Burns mini-site about Kozeluch, including updating the table sort options which had stopped working correctly when I added a new column to the table last week and fixing some typos with the introductory text. I also had a chat with the editor of the Anglo-Norman Dictionary about future developments and responded to a query from Ann Ferguson about the DSL bibliographies. Next week I will continue with the B&B developments.
My son returned to school on Monday this week, marking an end to the home-schooling that began after the Christmas holidays. It’s quite a relief to no longer have to split my day between working and home-schooling after so long. This week I continued with some Data Management Plan related activities, completing a DMP for the metaphor project involving Duncan of Jordanstone College of Art and Design in Dundee and drafting a third version of the DMP for Kirsteen McCue’s proposal following a Zoom call with her on Wednesday.
I also spent some further time on the Books and Borrowing project, creating tilesets and page records for several new volumes. In fact, we ran out of space on the server. The project is digitising around 20,000 pages of library records from 1750-1830 and we’re approaching 5,000 pages so far. I’d originally suggested that we’d need about 60GB of server space for the images (3MB per image x 20,000). However, the JPEGS we’ve been receiving from the digitisation units have been generated at maximum quality / minimum compression and are around 9MB each, so my estimates were out. Dropping the JPEG quality setting down from 12 to 10 would result in 3MB files so I could do this to save space if required. However, there is another issue. The tilesets I’m generating for each image so that they can be zoomed and panned like a Google Map are taking up as much as 18MB per image. So we may need a minimum of 540GB of space (possibly 600GB to be safe): 9×20,000 for the JPEGs plus 18×20,000 for the tilesets. This is an awful lot of space, and storing image tilesets isn’t actually necessary these days of an IIIF server (https://iiif.io/about/) could be set up. IIIF is now well established as the best means of hosting images online and it would be hugely useful to use. Rather than generating and hosting thousands of tilesets at different zoom levels we could store just one image per page on the server and it would serve up the necessary subsection at the required zoom level based on the request from the client. This issue is that people in charge of servers don’t’ like having to support new software. I entered into discussions with Stirling’s IT people about the possibility of setting up an IIIF server, and these talks are currently ongoing, so in the meantime I still need to generate the tilesets.
Also this week I discussed a couple of issues with the Thesaurus of Old English with Jane Roberts. A search was bringing back some word results but when loading the category browser no content was being displayed. Some investigations uncovered that these words were in subcategories of ’02.03.03.03.01’ but there was no main category with that number in the system. A subcategory needs a main category in order to display in the tree browser and as none was available nothing was displaying. Looking at the underlying database I discovered that while there was no ’02.03.03.03.01’ main category there were two ’02.03.03.03.01|01’ subcategories: ‘A native people’ and ‘Natives of a country’. I bumped the former up from subcategory to main category and the search results then worked.
I spent the rest of the week continuing with the development of the Anglo-Norman Dictionary. I made the new bibliography pages live this week (https://anglo-norman.net/bibliography/), which also involved updating the ‘cited source’ popup in the entry page so that it displays all of the new information. For example, go to this page: https://anglo-norman.net/entry/abanduner and click on the ‘A-N Med’ link to see a record with multiple items in it. I also updated the advanced search for citations so that the ‘Citation siglum’ drop-down list uses the new data too.
After that I continued to update the Dictionary Management System. I updated the ‘View / Download Entry’ page so that the ‘Phase’ of the entry can be updated if necessary. In the ‘Phase’ section of the page all of the phases are now listed as radio buttons, with the entry’s phase checked. If you need to change the entry’s phase you can select a different radio button and press the ‘Update Phase’ button. I also added facilities to manage phase statements via the DMS. In the menu there’s now an ‘Add Phase’ button, through which you can add a new phase, and a ‘Browse Phases’ button which lists all of the active phases, the number of entries assigned to each, and an option to edit the phase statement. If there’s a phase statement that has no associated entries you’ll find an option to delete it here too.
I’m still working on the facilities to upload and manage XML entry files via the DMS. I’ve added in a new menu item labelled ‘Upload Entries’ that when pressed on loads a page through which you can upload entry XML files. There’s a text box where you can supply the lead editor initials to be added to the batch of files you upload (any files that already have a ‘lead’ attribute will not be affected) and an option to select the phase statement that should be applied to the batch of files. Below this area is a section where you can either click to open a file browser and select files to upload or drag and drop files from Windows Explorer (or other file browser). When files are attached they will be processed, with the results shown in the ‘Update log’ section below the upload area. Uploaded files are kept entirely separate from the live dictionary until they’ve been reviewed and approved (I haven’t written these sections yet). The upload process will generate all of the missing attributes I mentioned last week – ‘lead’ initials, the various ID fields, POS, sense numbers etc. If any of these are present the system won’t overwrite them so it should be able to handle various versions of files. The system does not validate the XML files – the editors will need to ensure that the XML is valid before it is uploaded. However, the ‘preview’ option (see below) will quickly let you know if your file is invalid as the entry won’t display properly. Note also that you can change the ‘lead’ and the phase statement between batches – you can drag and drop a set of files with one lead and statement selected, then change these and upload another batch. You can of course choose to upload a single file too.
When XML files are uploaded, in the ‘update log’ there will be links directly through to a preview of the entry, but you can also find all entries that have been uploaded but not yet published on the website in the ‘Holding Area’, which is linked to in the DMS menu. There are currently two test files in this. The holding area lists the information about the XML entries that have been uploaded but not yet published, such as the IDs, the slug, the phase statement etc. There is also an option to delete the holding entry. The last two columns in the table are links to any live entry. There are two columns. The first links to the entry as specified by the numerical ID in the XML filename, which will be present in the filename of all XML files exported via the DMS’s ‘Download Entry’ option. This is the ‘existing ID’ column in the table. The second linking column is based on the ‘slug’ of the holding entry (generated from the ‘lemma’ in the XML). The ‘slug’ is unique in the data so if a holding entry has a link in this column it means it will overwrite this entry if it’s made live. For XML files exported view the DMS and them uploaded both ‘live entry’ links should be the same, unless the editor has changed the lemma. For new entries both these columns should be blank.
The ‘Review’ button opens up a preview of the uploaded holding entry in the interface of the live site. This allows the editors to proofread the new entry to ensure that the XML is valid and that everything looks right. You can return to the holding area from this page by pressing on the button in the left-hand column. Note that this is just a preview – it’s not ‘live’ and no-one else can see it.
There’s still a lot I need to do. I’ll be adding in an option to publish an entry in the holding area, at which point all of the data needed for searching will be generated and stored and the existing live entry (if there is one) will be moved to the ‘history’ table. I also maybe need to extract the earliest date information to display in the preview and in the holding area. This information is only extracted when the data for searching is generated, but I guess it would be good to see it in the holding area / preview too. I also need to add in a preview of cross reference entries as these don’t display yet. I should probably also add in an option to allow the editors to view / download the holding entry XML as they might want to check how the upload process has changed this. So still lots to tackle over the coming weeks.
I lost most of Tuesday this week to root canal surgery, which was uncomfortable and exhausting but thankfully not too painful. Unfortunately my teeth are still not right and I now have a further appointment booked for next week, but at least the severe toothache that I had previously has now stopped.
I continued to work on the requirements document for the redevelopment of the Anglo-Norman Dictionary this week, and managed to send a first completed version of it to Heather Pagan for feedback. It will no doubt need some further work, but it’s good to have a clearer picture of how the new version of the website will function. Also this week I investigated another bizarre situation with the AND’s data. I have access to the full dataset as is used to power the existing website as a single XML file containing all of the entries. The Editors are also working on individual entries as single XML files that are then uploaded to the existing website using a content management system. What we didn’t realise up until now is that the structure of the XML files is transformed when an entry is ingested into the online system. For example, the ‘language’ tag is changed from <language lang=”M.E.”/> to <lbl type=”lang”>M.E.</lbl>. Similarly, part of speech has been transformed from <pos type=”s.”/> to <gramGrp><pos>s.</pos></gramGrp>. We have no idea why the developer of the system chose to do this, as it seems completely unnecessary and it’s a process that doesn’t appear to be documented anywhere. The crazy thing is that the transformed XML still then needs to be further transformed to HTML for display so what appears on screen is two steps removed from the data the editors work with. It also means that I don’t have access to the data in the form that the editors are working with, meaning I can’t just take their edits and use them in the new site.
As we ideally want to avoid the situation where we have two structurally different XML datasets for the dictionary I wanted to try and find a way to transform the data I have into the structure used by the editors. I attempted to do this by looking at the code for the existing content management system to try to decipher where the XML is getting transformed. There is an option for extracting an entry from the online system for offline editing and this transforms the XML into the format used by the editors. I figured that if I can understand how this process works and replicate it then I will be able to apply this to the full XML dictionary file and then I will have the complete dataset in the same format as the editors are working with and we can just use this in the redevelopment.
It was not easy to figure out what the system is up to, but I managed to ascertain that when you enter a headword for export this then triggers a Perl script and this in turn uses an XSLT stylesheet, which I managed to track down a version of that appears to have been last updated in 2014. I then wrote a little script that takes the XML of the entry for ‘padlock’ as found in online data and applies this stylesheet to it, in the hope that it would give me an XML file identical to the one exported by the CMS.
The script successfully executed, but the resulting XML was not quite identical to the file exported by the CMS. There was no ‘doctype’ and DTD reference, the ‘attestation’ ID was the entry ID with an auto-incrementing ‘C’ number appended to it (AND-201-02592CE7-42F65840-3D2007C6-27706E3A-C001) rather than the ID of the <cit> element (C-11c4b015), and <dateInfo> was not processed, and only the contents of the tags within <dateInfo> were being displayed.
I’m not sure why these differences exist. It’s possible I only have access to an older version of the XSLT file. I’m guessing this must be the case because the missing or differently formatted data is does not appear to be instated elsewhere (e.g. in the Perl script). What I then did was to modify the XSLT file to ensure that the changes are applied: doctype is added in, the ‘attestation’ ID is correct and the <dateInfo> section contains the full data.
I could try applying this script to every entry in the full data file I have, although I suspect there will be other situations that the XSLT file I have is not set up to successfully process.
I therefore tried to investigate another alternative, which was to write a script that will pass the headword of every dictionary item to the ‘Retrieve an entry for editing’ script in the CMS, saving the results of each. I considered that might be more likely to work reliably for every entry, but that we may run into issues with the server refusing so many requests. After a few test runs, I set the script loose on all 53,000 or so entries in the system and although it took several hours to run, the process did appear to work for the most part. I now have the data in the same structure as the editors work with, which should mean we can standardise on this format and abandon the XML structure used by the existing online system.
Also this week I fixed an issue with links through to the Bosworth and Toller Old English dictionary from the Thesaurus of Old English. Their site has been redeveloped and they’ve changed the way their URLs work without putting redirects from the old URLs, meaning all our links for words in the TOE to words on their site are broken. URLs for their entries now just use a unique ID rather than the word (e.g. http://bosworthtoller.com/28286), which seems like a bit of a step backwards. They’ve also got rid of length marks and are using acute accents on characters instead, which is a bit strange. The change to an ID in the URL means we can no longer link to a specific entry as we can’t possibly know what IDs they’re using for each word. However, we can link to their search results page instead, e.g. http://bosworthtoller.com/search?q=sōfte works and I updated TOE to use such links.
I also continued with the processing of OED dates for use in the Historical Thesaurus, after my date extraction script finished executing over the weekend. This week I investigated OED dates that have a dot in them instead of a full date. There are 4,498 such dates and these mostly all have the lower date as the one recorded in the ‘year’ attribute by the OED. E.g. 138. Is 1380, 17.. is 1700. However, sometimes a specific date is given in the ‘year’ attribute despite the presence of a full stop in the date tag. For example, one entry has ‘1421’ in the ‘year’ attribute but ’14..’ in the date tag. There are just over a thousand dates where there are two dots but the ‘year’ given does not end in ‘00’. Fraser reckons this is to do with ordering the dates in the OED and I’ll need to do some further work on this next week.
In addition to the above I continued to work on the Books and Borrowing project. I made some tweaks to the CMS to make is easier to edit records. When a borrowing record is edited the page automatically scrolls down to the record that was edited. This also happens for books and borrowers when accessed and edited from the ‘Books’ and ‘Borrowers’ tabs in a library. I also wrote an initial script that will help to merge some of the duplicate author records we have in the system due to existing data with different formats being uploaded from different libraries. What it does is strip all of the non-alpha characters from the forename and surname fields, makes them lower case then joins them together. So for example, author ID (AID) 111 has ‘Arthur’ as forename and ‘Bedford’ as surname while AID 1896 has nothing for forename and ‘Bedford, Arthur, 1668-1745’ as surname. When stripped and joined together these both become ‘bedfordarthur’ and we have a match.
There are 162 matches that have been identified, some consisting of more than two matched author records. I exported these as a spreadsheet. Each row includes the author’s AID, title, forename, surname, othername, born and died (each containing ‘c’ where given), a count of the number of books the record is associated with and the AID of the record that is set to be retained for the match. This defaults to the first record, which also appears in bold, to make it easier to see where a new batch of duplicates begins.
The editors can then go through this spreadsheet and reassign the ‘AID to keep’ field to a different row. E.g. for Francis Bacon the AID to keep is given as 1460. If the second record for Francis Bacon should be kept instead the editor would just need to change the value in this column for all three Francis Bacons to the AID for this row, which is 163. Similarly, if something has been marked as a duplicate and it’s wrong, then set the ‘AID to keep’ accordingly. E.g. There are four ‘David Hume’ records, but looking at the dates at least one of these is a different person. To keep the record with AID 1610 separate, replace the AID 1623 in the ‘AID to keep’ column with 1610. It is likely that this spreadsheet will be used to manually split up the imported authors that just have all their data in the surname column. Someone could, for example take the record that has ‘Hume, David, 1560?-1630?’ in the surname column and split this into the correct columns.
I also generated a spreadsheet containing all of the authors that appear to be unique. This will also need checking for other duplicates that haven’t been picked up as there are a few. For example AID 1956 ‘Heywood, Thomas, d. 1641’ and 1570 ‘Heywood, Thomas, -1641.’ Haven’t been matched because of that ‘d’. Similarly, AID 1598 ‘Buffon, George Louis Leclerc, comte de, 1707-1788’ and 2274 ‘Buffon, Georges Louis Leclerc, comte de, 1707-1788.’ Haven’t been matched up because one is ‘George’ and the other ‘Georges’. Accented characters have also not been properly matched, e.g. AID 1457 ‘Beze, Theodore de, 1519-1605’ and 397 ‘Bèze, Théodore de, 1519-1605.’. I could add in a Levenshtein test that matches up things that are one character different and update the script to properly take into account accented characters for matching purposes, or these are things that could just be sorted manually.
Ann Fergusson of the DSL got back to me this week after having rigorously tested the search facilities of our new test versions of the DSL API (V2 containing data from the original API and V3 containing data that has been edited since the original API was made). Ann had spotted some unexpected behaviour in some of the searches and I spent some time investigating these and fixing things where possible. There were some cases where incorrect results were being returned when a ‘NOT’ search was performed on a selected source dictionary, due to the positioning of the source dictionary in the query. This was thankfully easy to fix. There was also an issue with some exact searches of the full text failing to find entries. When the full text is ingested into Solr all of the XML tags are stripped out. If there are no spaces between tagged words then words have ended up squashed together. For example: ‘Westminster</q></cit></sense><sense><b>B</b>. <i>Attrib</i>’. With the tags (and punctuation) stripped out we’re left with ‘WestminsterB’. So an exact search for ‘westminster’ fails to find this entry. A search for ‘westminsterb’ finds the entry, which confirms this. I suspect this situation is going to crop up quite a lot, so I will need to update the script that prepares content for Solr to add spaces after tags before stripping tags and then removes multiple spaces between words.
This was week 18 of Lockdown, which is now definitely easing here. I’m still working from home, though, and will be for the foreseeable future. I took Friday off this week, so it was a four-day week for me. I spent about half of this time on the Books and Borrowing project, during which time I returned to adding features to the content management system, after spending recent weeks importing datasets. I added a number of indexes to the underlying database which should speed up the loading of certain pages considerably. E.g. the browse books, borrowers and author pages. I then updated the ‘Books’ tab when viewing a library (i.e. the page that lists all of the book holdings in the library) so that it now lists the number of book holdings in the library above the table. The table itself now has separate columns for all additional fields that have been created for book holdings in the library and it is now possible to order the table by any of the headings (pressing on a heading a second time reverses the ordering). The count of ‘Borrowing records’ for each book in the table is now a button and pressing on it brings up a popup listing all of the borrowing records that are associated with the book holding record, and from this pop-up you can then follow a link to view the borrowing record you’re interested in. I then made similar changes to the ‘Borrowers’ tab when viewing a library (i.e. the page that lists all of the borrowers the library has). It also now displays the total number of borrowers at the top. This table already allowed the reordering by any column, so that’s not new, but as above, the ‘Borrowing records’ count is now a link that when clicked on opens a list of all of the borrowing records the borrower is associated with.
The big new feature I implemented this week was borrower cross references. These can be added via the ‘Borrowers’ tab within a library when adding or editing a borrower on this page. When adding or editing a borrower there is now a section of the form labelled ‘Cross-references to other borrowers’. If there are any existing cross references these will appear here, with a checkbox beside each that you can tick if you want to delete the cross reference (the user can tick the box then press ‘Edit’ to edit the borrower and the reference will be deleted). Any number of new cross references can be added by pressing on the ‘Add a cross-reference’ button (multiple times, if required). Doing so adds two fields to the form, one for a ‘description’, which is the text that shows how the current borrower links to the referenced borrowing record, and one for ‘referenced borrower’, which is an auto-complete. Type in a name or part of a name and any borrower that matches in any library will be listed. The library appears in brackets after the borrower’s name to help differentiate records. Select a borrower and then when the ‘Add’ or ‘Edit’ button is pressed for the borrower the cross reference will be made.
Cross-references work in both directions – if you add a cross reference from Borrower A to Borrower B you don’t then need to load up the record for Borrower B to add a reference back to Borrower A. The description text will sit between the borrower whose form you make the cross reference on and the referenced borrower you select, so if you’re on the edit form for Borrower A and link to Borrower B and the description is ‘is the son of’ then the cross reference will appear as ‘Borrower A is the son of Borrower B’. If you then view Borrower B the cross reference will still be written in this order. I also updated the table of borrowers to add in a new ‘X-Refs’ column that lists all cross-references for a borrower.
I spent the remainder of my working week completing smaller tasks for a variety of projects, such as updating the spreadsheet output of duplicate child entries for the DSL people, getting an output of the latest version of the Thesaurus of Old English data for Fraser, advising Eleanor Lawson on ‘.ac.uk’ domain names and having a chat with Simon Taylor about the pilot Place-names of Fife project that I worked on with him several years ago. I also wrote a Data Management Plan for a new AHRC proposal the Anglo-Norman Dictionary people are putting together, which involved a lengthy email correspondence with Heather Pagan at Aberystwyth.
Finally, I returned to the ongoing task of merging data from the Oxford English Dictionary with the Historical Thesaurus. We are currently attempting to extract citation dates from OED entries in order to update the dates of usage that we have in the HT. This process uses the new table I recently generated from the OED XML dataset which contains every citation date for every word in the OED (more than 3 million dates). Fraser had prepared a document listing how he and Marc would like the HT dates to be updated (e.g. if the first OED citation date is earlier than the HT start date by 140 years or more then use the OED citation date as the suggested change). Each rule was to be given its own type, so that we could check through each type individually to make sure the rules were working ok.
It took about a day to write an initial version of the script, which I ran on the first 10,000 HT lexemes as a test. I didn’t split the output into different tables depending on the type, but instead exported everything to a spreadsheet so Marc and Fraser could look through it.
In the spreadsheet if there is no ‘type’ for a row it means it didn’t match any of the criteria, but I included these rows anyway so we can check whether there are any other criteria the rows should match. I also included all the OED citation dates (rather than just the first and last) for reference. I noted that Fraser’s document doesn’t seem to take labels into consideration. There are some labels in the data, and sometimes there’s a new label for an OED start or end date when nothing else is different, e.g. htid 1479 ‘Shore-going’: This row has no ‘type’ but does have new data from the OED.
Another issue I spotted is that as the same ‘type’ variable is set when a start date matches the criteria and then when an end date matches the criteria, the ‘type’ as set during start date is then replaced with the ‘type’ for end date. I think, therefore, that we might have to split the start and end processes up, or append the end process type to the start process type rather than replacing it (so e.g. type 2-13 rather than type 2 being replaced by type 13). I also noticed that there are some lexemes where the HT has ‘current’ but the OED has a much earlier last citation date (e.g. htid 73 ‘temporal’ has 9999 in the HT but 1832 in the OED. Such cases are not currently considered.
Finally, according to the document, Antes and Circas are only considered for update if the OED and HT date is the same, but there are many cases where the start / end OED date is picked to replace the HT date (because it’s different) and it has an ‘a’ or ‘c’ and this would then be lost. Currently I’m including the ‘a’ or ‘c’ in such cases, but I can remove this if needs be (e.g. HT 37 ‘orb’ has HT start date 1601 (no ‘a’ or ‘c’) but this is to be replaced with OED 1550 that has an ‘a’. Clearly the script will need to be tweaked based on feedback from Marc and Fraser, but I feel like we’re finally making some decent progress with this after all of the preparatory work that was required to get to this point.
Next Monday is the Glasgow Fair holiday, so I won’t be back to work until the Tuesday.
This was week 8 of Lockdown and I spent the majority of it working on the content management system for the Books and Borrowing project. The project is due to begin at the start of June and I’m hoping to have the CMS completed and ready to use by the project team by then, although there is an awful lot to try and get into place. I can’t really go into too much detail about the CMS, but I have completed the pages to add a library and to browse a list of libraries with the option of deleting a library if it doesn’t have any ledgers. I’ve also done quite a lot with the ‘View library’ page. It’s possible to edit a library record, add a ledger and add / edit / delete additional fields for a library. You can also list all of the ledgers in a library with options to edit the ledger, delete it (if it contains no pages) and add a new page to it. You can also display a list of pages in a ledger, with options to edit the page or delete it (if it contains no records). You can also open a page in the ledger and browse through the next and previous pages.
At the moment I’m in the middle of creating the facility to add a new borrowing record to the page. This is the most complex part of the system as a record may have multiple borrowers, each of which may have multiple occupations, and multiple books, each of which may be associated with higher level book records. Plus the additional fields for the library need to be taken into consideration too. By the end of the week I was at the point of adding in an auto-complete to select an existing borrower record and I’ll continue with this on Monday.
In addition to the B&B project I did some work for other projects as well. For Thomas Clancy’s Place-names of Kirkcudbrightshire project (now renamed Place-names of the Galloway Glens) I had a few tweaks and updates to put in place before Thomas launched the site on Tuesday. I added a ‘Search place-names’ box to the right-hand column of every non-place-names page which takes you to the quick search results page and I added a ‘Place-names’ menu item to the site menu, so users can access the place-names part of the site. Every place-names page now features a sub-menu with access to the place-names pages (Browse, element glossary, advanced search, API, quick search). To return to the place-name introductory page you can press on the ‘Place-names’ link in the main menu bar. I had unfortunately introduced a bug to the ‘edit place-name’ page in the CMS when I changed the ordering of parishes to make KCB parishes appear first. This was preventing any place-names in BMC from having their cross references, feature type and parishes saved when the form was submitted. This has now been fixed. I also added Google Analytics to the site. The virtual launch on Tuesday went well and the site can now be accessed here: https://kcb-placenames.glasgow.ac.uk/.
I also added in links to the DSL’s email and Instagram accounts to the footer of the DSL site and added some new fields to the database and CMS of the Place-names of Mull and Ulva site. I also created a new version of the Burns Supper map for Paul Malgrati that included more data and a new field for video dimensions that the video overlay now uses. I also replied to Matthew Creasy about a query regarding the website for his new Scottish Cosmopolitanism project and a query from Jane Roberts about the Thesaurus of Old English and made a small tweak to the data of Gerry McKeever’s interactive map for Regional Romanticism.