I continued to work on the Books and Borrowing project for a lot of this week, completing some of the tasks I began last week and working on some others. We ran out of server space for digitised page images last week, and although I freed up some space by deleting a bunch of images that were no longer required we still have a lot of images to come. The team estimates that a further 11,575 images will be required. If the images we receive for these pages are comparable to the ones from the NLS, which average around 1.5Mb each, then 30Gb should give us plenty of space. However, after checking through the images we’ve received from other digitisation units it turns out that the NLS images are a vit of an outlier in term of file size and generally 8-10Mb is more usual. If we use this as an estimate then we would maybe require 120Gb-130Gb of additional space. I did some experiments with resizing and changing the image quality of one of the larger images, managing to bring an 8.4Mb image down to 2.4Mb while still retaining its legibility. If we apply this approach to the tens of thousands of larger images we have then this would result in a considerable saving of storage. However, Stirling’s IT people very kindly offered to give us a further 150Gb of space for the images so this resampling process shouldn’t be needed for now at least.
Another task for the project this week was to write a script to renumber the folio numbers for the 14 volumes from the Advocates Library that I noticed had irregular numbering. Each of the 14 volumes had different issues with their handwritten numbering, so I had to tailor my script to each volume in turn, and once the process was complete the folio numbers used to identify page images in the CMS (and eventually in the front-end) entirely matched the handwritten numbers for each volume.
My next task for the project was to import the records for several volumes from the Royal High School of Edinburgh but I ran into a bit of an issue. I had previously been intending to extract the ‘item’ column and create a book holding record and a single book item record for each distinct entry in the column. This would then be associated with all borrowing records in RHS that also feature this exact ‘item’. However, this is going to result in a lot of duplicate holding records due to the contents of the ‘item’ column including information about different volumes of a book and/or sometimes using different spellings.
For example, in SL137142 the book ‘Banier’s Mythology’ appears four times as follows (assuming ‘Banier’ and ‘Bannier’ are the same):
- Banier’s Mythology v. 1, 2
- Banier’s Mythology v. 1, 2
- Bannier’s Myth 4 vols
- Bannier’s Myth. Vol 3 & 4
My script would create one holding and item record for ‘Banier’s Mythology v. 1, 2’ and associate it with the first two borrowing records but the 3rd and 4th items above would end up generating two additional holding / item records which would then be associated with the 3rd and 4th borrowing records.
No script I can write (at least not without a huge amount of work) would be able to figure out that all four of these books are actually the same, or that there are actually 4 volumes for the one book, each requiring its own book item record, and that volumes 1 & 2 need to be associated with borrowing records 1&2 while all 4 volumes need to be associated with borrowing record 3 and volumes 3&4 need to be associated with borrowing record 4. I did wonder whether I might be able to automatically extract volume data from the ‘item’ column but there is just too much variation.
We’re going to have to tackle the normalisation of book holding names and the generation of all required book items for volumes at some point and this either needs to be done prior to ingest via the spreadsheets or after ingest via the CMS.
My feeling is that it might be simpler to do it via the spreadsheets before I import the data. If we were to do this then the ‘Item’ column would become the ‘original title’ and we’d need two further columns, one for the ‘standardised title’ and one listing the volumes, consisting of a number of each volume separated with a comma. With the above examples we would end up with the following (with a | representing a column division):
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Banier’s Mythology v. 1, 2 | Banier’s Mythology | 1,2
- Bannier’s Myth 4 vols | Banier’s Mythology | 1,2,3,4
- Bannier’s Myth. Vol 3 & 4 | Banier’s Mythology | 3,4
If each sheet of the spreadsheet is ordered alphabetically by the ‘item’ column it might not take too long to add in this information. The additional fields could also be omitted where the ‘item’ column has no volumes or different spellings. E.g. ‘Hederici Lexicon’ may be fine as it is. If the ‘standardised title’ and ‘volumes’ columns are left blank in this case then when my script reaches the record it will know to use ‘Hederici Lexicon’ as both original and standardised titles and to generate one single unnumbered book item record for it. We agreed that normalising the data prior to ingest would be the best approach and I will therefore wait until I receive updated data before I proceed further with this.
Also this week I generated a new version of a spreadsheet containing the records for one register for Gerry McKeever, who wanted borrowers, book items and book holding details to be included in addition to the main borrowing record. I also made a pretty major update to the CMS to enable books and borrower listings for a library to be filtered by year of borrowing in addition to filtering by register. Users can either limit the data by register or year (not both). They need to ensure the register drop-down is empty for the year filter to work, otherwise the selected register will be used as the filter. On either the ‘books’ or ‘borrowers’ tab in the year box they can add either a single year (e.g. 1774) or a range (e.g. 1770-1779). Then when ‘Go’ is pressed the data displayed is limited to the year or years entered. This also includes the figures in the ‘borrowing records’ and ‘Total borrowed items’ columns. Also, the borrowing records listed when a related pop-up is opened will only feature those in the selected years.
I also worked with Raymond in Arts IT Support and Geert, the editor of the Anglo-Norman Dictionary to complete the process of migrating the AND website to the new server. The website (https://anglo-norman.net/) is now hosted on the new server and is considerably faster than it was previously. We also took the opportunity the launch the Anglo-Norman Textbase, which I had developed extensively a few months ago. Searching and browsing can be found here: https://anglo-norman.net/textbase/ and this marks the final major item in my overhaul of the AND resource.
My last major task of the week was to start work on a database of ultrasound video files for the Speech Star project. I received a spreadsheet of metadata and the video files from Eleanor this week and began processing everything. I wrote a script to export the metadata into a three-table related database (speakers, prompts and individual videos of speakers saying the prompts) and began work on the front-end through which this database and the associated video files will be accessed. I’ll be continuing with this next week.
In addition to the above I also gave some advice to the students who are migrating the IJOSTS journal over the WordPress, had a chat with the DSL people about when we’ll make the switch to the new API and data, set up a WordPress site for Joanna Kopaczyk for the International Conference on Middle English, upgraded all of the WordPress sites I manage to the latest version of WordPress, made a few tweaks to the 17th Century Symposium website for Roslyn Potter, spoke to Kate Simpson in Information Studies about speaking to her Digital Humanities students about what I do and arranged server space to be set up for the Speak For Yersel project website and the Speech Star project website. I also helped launch the new Burns website: https://burnsc21-letters-poems.glasgow.ac.uk/ and updated the existing Burns website to link into it via new top-level tabs. So a pretty busy week!
I spent a bit of time this week writing an abstract for the DH2022 conference. I wrote about how I rescued the data for the Anglo-Norman Dictionary in order to create the new AND website. The DH abstracts are actually 750-1000 words long so it took a bit of time to write. I have sent it on to Marc for feedback and I’ll need to run it by the AND editors before submission as well (if it’s worth submitting). I still don’t know whether there would be sufficient funds for me to attend the event, plus the acceptance rate for papers is very low, so I’ll just need to see how this develops.
Also this week I participated in a Zoom call for the DSL about user feedback and redeveloping the DSL website. It was a pretty lengthy call, but it was interesting to be a part of. Marc mentioned a service called Hotjar (https://www.hotjar.com/) that allows you to track how people use your website (e.g. tracking their mouse movements) and this seemed like an interesting way of learning about how an interface works (or doesn’t). I also had a conversation with Rhona about the updates to the DSL DNS that need to be made to improve the security or their email systems. Somewhat ironically, recent emails from their IT people had ended up in my spam folder and I hadn’t realised they were asking me for further changes to be made, which unfortunately has caused a delay.
I spoke to Gerry Carruthers about another new project he’s hoping to set up, and we’ll no doubt be having a meeting about this in the coming weeks. I also gave some advice to the students who are migrating the IJOSTS articles to WordPress too and made some updates to the Iona Placenames website in preparation for their conference.
For the Anglo-Norman Dictionary I fixed an issue with one of the textbase texts that had duplicate notes in one of its pages and then I worked on a new feature for the DMS that enables the editors to search the phrases contained in locutions in entries. Editors can either match locution phrases beginning with a term (e.g. ta*), ending with a term (e.g. *de) or without a wildcard the term can appear anywhere in the phrase. Other options found on the public site (e.g. single character wildcards and exact matches) are not included in this search.
The first time a search is performed the system needs to query all entries to retrieve only those that feature a locution. These results are then stored in the session for use the next time a search is performed. This means subsequent searches in a session should be quicker, and also means if the entries are updated between sessions to add or remove locutions the updates will be taken into consideration.
Search results work in a similar way to the old DMS option: Any matching locution phrases are listed, together with their translations if present (if there are multiple senses / subsenses for a locution then all translations are listed, separated by a ‘|’ character). Any cross references appear with an arrow and then the slug of the cross referenced entry. There is also a link to the entry the locution is part of, which opens in a new tab on the live site. A count of the total number of entries with locutions, the number of entries your search matched a phrase in and the total number of locutions is displayed above the results.
I spent the rest of the week working on the Speak For Yersel project. We had a Zoom call on Monday to discuss the mockups I’d been working on last week and to discuss the user interface that Jennifer and Mary would like me to develop for the site (previous interfaces were just created for test purposes). I spent the rest of my available time developing a further version of the grammar exercise with the new interface, that included logos, new fonts and colour schemes, sections appearing in different orders and an overall progress bar for the full exercise rather than individual ones for the questionnaire and the quiz sections.
I added in UoG and AHRC logos underneath the exercise area and added both an ‘About’ and ‘Activities’ menu items with ‘Activities’ as the active item. The active state of the menu wasn’t mentioned in the document but I gave it a bottom border and made the text green not blue (but the difference is not hugely noticeable). This is also used when hovering over a menu item. I made the ‘Let’s go’ button blue not green to make it consistent with the navigation button in subsequent stages. When a new stage loads the page now scrolls to the top as on mobile phones the content was changing but the visible section remained as it was previously, meaning the user had to manually scroll up. I also retained the ‘I would never say that!’ header in the top-left corner of all stages rather than having ‘activities’ so it’s clearer what activity the user is currently working on. For the map in the quiz questions I’ve added the ‘Remember’ text above the map rather than above the answer buttons as this seemed more logical and on the quiz the map pane scrolls up and scrolls down when the next question loads so as to make it clearer that it’s changed. Also, the quiz score and feedback text now scroll down one after the other and in the final ‘explore’ page the clicked on menu item now remains highlighted to make it clearer which map is being displayed. Here’s a screenshot of how the new interface looks:
I was back at work this week after having a lovely holiday in Northumberland last week. I spent quite a bit of time in the early part of the week catching up with emails that had come in whilst I’d been off. I fixed an issue with Bryony Randall’s https://imprintsarteditingmodernism.glasgow.ac.uk/ site, which was put together by an external contractor, but I have now inherited. The site menu would not update via the WordPress admin interface and after a bit of digging around in the source files for the theme it would appear that the theme doesn’t display a menu anywhere, that is the menu which is editable from the WordPress Admin interface is not the menu that’s visible on the public site. That menu is generated in a file called ‘header.php’ and only pulls in pages / posts that have been given one of three specific categories: Commissioned Artworks, Commissioned Text or Contributed Text (which appear as ‘Blogs’). Any post / page that is given one of these categories will automatically appear in the menu. Any post / page that is assigned to a different category or has no assigned category doesn’t appear. I added a new category to the ‘header’ file and the missing posts all automatically appeared in the menu.
I also updated the introductory texts in the mockups for the STAR websites and replied to a query about making a place-names website from a student at Newcastle. I spoke to Simon Taylor about a talk he’s giving about the place-name database and gave him some information on the database and systems I’d created for the projects I’ve been involved with. I also spoke to the Iona Place-names people about their conference and getting the website ready for this.
I also had a chat with Luca Guariento about a new project involving the team from the Curious Travellers project. As this is based in Critical Studies Luca wondered whether I’d write the Data Management Plan for the project and I said I would. I spent quite a bit of time during the rest of the week reading through the bid documentation, writing lists of questions to ask the PI, emailing the PI, experimenting with different technologies that the project might use and beginning to write the Plan, which I aim to complete next week.
The project is planning on running some pre-digitised images of printed books through an OCR package and I investigated this. Google owns and uses a program called Tesseract to run OCR for Google Books and Google Docs and it’s freely available (https://opensource.google/projects/tesseract). It’s part of Google Docs – if you upload an image of text into Google Drive then open it in Google Docs the image will be automatically OCRed. I took a screenshot of one of the Welsh tour pages (https://viewer.library.wales/4690846#?c=&m=&s=&cv=32&manifest=https%3A%2F%2Fdamsssl.llgc.org.uk%2Fiiif%2F2.0%2F4690846%2Fmanifest.json&xywh=-691%2C151%2C4725%2C3632) and cropped the text and then opened it in Google Docs and even on this relatively low resolution image the OCR results are pretty decent. It managed to cope with most (but not all) long ‘s’ characters and there are surprisingly few errors – ‘Englija’ and ‘Lotty’ are a couple and have been caused by issues with the original print quality. I’d say using Tesseract is going to be suitable for the project.
I spent a bit of time working on the Speak For Yersel project. We had a team meeting on Thursday to go through in detail how one of the interactive exercises will work. This one will allow people to listed to a sound clip and then relisten to it in order to click whenever they hear something that identifies the speaker as coming from a particular location. Before the meeting I’d prepared a document giving an overview of the technical specification of the feature and we had a really useful session discussing the feature and exactly how it should function. I’m hoping to make a start on a mockup of the feature next week.
Also for the project I’d enquired with Arts IT Support as to whether the University held a license for ArcGIS Online, which can be used to publish maps online. It turns out that there is a University-wide license for this which is managed by the Geography department and a very helpful guy called Craig MacDonell arranged for me and the other team members to be set up with accounts for it. I spent a bit of time experimenting with the interface and managed to publish a test heatmap based on data from SCOSYA. I can’t get it to work directly with the SCOSYA API as it stands, but after exporting and tweaking one of the sets of rating data as a CSV I pretty quickly managed to make a heatmap based on the ratings and publish it: https://glasgow-uni.maps.arcgis.com/apps/instant/interactivelegend/index.html?appid=9e61be6879ec4e3f829417c12b9bfe51 This is just a really simple test, but we’d be able to embed such a map in our website and have it pull in data dynamically from CSVs generated in real-time and hosted on our server.
Also this week I had discussions with the Dictionaries of the Scots Language people about how dates will be handled. Citation dates are being automatically processed to add in dates as attributes that can then be used for search purposes. Where there are prefixes such as ‘a’ and ‘c’ the dates are going to be given ranges based on values for these prefixes. We had a meeting to discuss the best way to handle this. Marc had suggested that having a separate prefix attribute rather than hard coding the resulting ranges would be best. I agreed with Marc that having a ‘prefix’ attribute would be a good idea, not only because it means we can easily tweak the resulting date ranges at a later point rather than having them hard-coded, but also because it then gives us an easy way to identify ‘a’, ‘c’ and ‘?’ dates if we ever want to do this. If we only have the date ranges as attributes then picking out all ‘c’ dates (e.g. show me all citations that have a date between 1500 and 1600 that are ‘c’) would require looking at the contents of each date tag for the ‘c’ character which is messier.
A concern was raised that not having the exact dates as attributes would require a lot more computational work for the search function, but I would envisage generating and caching the full date ranges when the data is imported into the API so this wouldn’t be an issue. However, there is a potential disadvantage to not including the full date range as attributes in the XML, and this is that if you ever want to use the XML files in another system and search the dates through it the full ranges would not be present in the XML so would require processing before they could be used. But whether the date range is included in the XML or not I’d say it’s important to have the ‘prefix’ as an attribute, unless you’re absolutely sure that being able to easily identify dates that have a particular prefix isn’t important.
We decided that prefixes would be stored as attributes and that the date ranges for dates with a prefix would be generated whenever the data is exported from the DSL’s editing system, meaning editors wouldn’t have to deal with noting the date ranges and all the data would be fully usable without further processing as soon as it’s exported.
Also this week I was given access to a large number of images of registers from the Advocates Library that had been digitised by the NLS. I downloaded these, batch processed them to add in the register numbers as a prefix to the filenames, uploaded the images to our server, created register records for each register and page records for each page. The registers, pages and associated images can all now be accessed via our CMS.
My final task of the week was to continue work on the Anglo-Norman Dictionary. I completed work on the script identifies which citations have varlists and which may need to have their citation date updated based on one of the forms in the varlist. What the script does is to retrieve all entries that have a <varlist> somewhere in them. It then grabs all of the forms in the <head> of the entry. It then goes through every attestation (main sense and subsense plus locution sense and subsense) and picks out each one that has a <varlist> in it.
For each of these it then extracts the <aform> if there is one, or if there’s not then it extracts the final word before the <varlist>. It runs a Levenshtein test on this ‘aform’ to ascertain how different it is from each of the <head> forms, logging the closest match (0 = exact match of one form, 1 = one character different from one of the forms etc). It then picks out each <ms_form> in the <varlist> and runs the same Levenshtein test on each of these against all forms in the <head>.
If the score for the ‘aform’ is lower or equal to the lowest score for an <ms_form> then the output is added to the ‘varlist-aform-ok’ spreadsheet. If the score for one of the <ms_form> words is lower than the ‘aform’ score the output is added to the ‘varlist-vform-check’ spreadsheet.
My hope is that by using the scores we can quickly ascertain which are ok and which need to be looked at by ordering the rows by score and dealing with the lowest scores first. In the first spreadsheet there are 2187 rows that have a score of 0. This means the ‘aform’ exactly matches one of the <head> forms. I would imagine that these can safely be ignored. There are a further 872 that have a score of 1, and we might want to have a quick glance through these to check they can be ignored, but I suspect most will be fine. The higher the score the greater the likelihood that the ‘aform’ is not the form that should be used for dating purposes and one of the <varlist> forms should instead. These would need to be checked and potentially updated.
The other spreadsheet contains rows where a <varlist> form has a lower score than the ‘aform’ – i.e. one of the <varlist> forms is closer to one of the <head> forms than the ‘aform’ is. These are the ones that are more likely to have a date that needs updated. The ‘Var forms’ column lists each var form and its corresponding score. It is likely that the var form with the lowest score is the form that we would need to pick the date out for.
In terms of what the editors could do with the spreadsheets: My plan was that we’d add an extra column to note whether a row needs updated or not – maybe called ‘update’ – and be left blank for rows that they think look ok as they are and containing a ‘Y’ for rows that need to be updated. For such rows they could manually update the XML column to add in the necessary date attributes. Then I could process the spreadsheet in order to replace the quotation XML for any attestations that needs updated.
For the ‘vform-check’ spreadsheet I could update my script to automatically extract the dates for the lowest scoring form and attempt to automatically add in the required XML attributes for further manual checking, but I think this task will require quite a lot of manual checking from the onset so it may be best to just manually edit the spreadsheet here too.
I headed into the University for the first time this year on Wednesday this week to collect a new iPad that I’d ordered and to get some files from my office. It was great to see the old place again, but it did take quite a chunk out of my day to travel there and back, especially as I’m still home-schooling either a morning or an afternoon each day at the moment too.
As with last week, I mainly divided my time this week between the Dictionary of the Scots Language, the Anglo-Norman Dictionary and the Books and Borrowing project, with a few other bits and bobs added in as well. For the DSL I retrieved the source code for my original Scots School Dictionary app from my office so we can host this somewhere on the DSL website. This is because the DSL have commissioned someone else to make a new School Dictionary app, which launched this week, but doesn’t include an ‘English to Scots’ feature as the old app does, so we’re going to make the old app available as a website for those people who miss the feature. I also made a few minor tweaks to the main DSL site, and then focussed on adding bibliography search facilities to the new version of the API, a task that I’d begun last week.
I created a new table for the bibliographical data that includes the various fields used for DOST (note, author, editor, date, longtitle etc) and a field for the XML data used for SND. I then created two further tables for searching, one that contains every author and editor name for each item (for DOST there may be different names in the author, editor, longauthor and longeditor fields while for SND there may be any number of <author> tags) and the other containing every title for each item (DOST may have different text in title and longtitle while SND items can have any number of <title> tags). These tables allow you to search for any variant author, editor or title and find the item.
I also created two additional fields in the bibliography table that contain the ‘display author’ and ‘display title’. These are the forms that get displayed in the search results before you click on an item to open the full bibliographical entry. I then updated the V4 API to add in facilities to search and retrieve the bibliographies. I didn’t have the time to connect to this API and to implement the search on the Sienna test site, which is something I hope to do next week, but the logic behind the search and display of bibliographies is all there. There is a predictive search that will be used to generate the autocomplete list, similar to how the live site currently works: You will be able to select whether your search is for authors, titles or both and when you start typing in some text a list of matching items will appear, e.g. typing in ‘ham’ for authors in both dictionaries will display the following all items containing ‘ham’ and when you select an item this will then perform a search for the specific text. You will then be able to click on an item to view the full bibliography. This is a bit different to how the live site currently works, as with these if you enter ‘ham’ and select (for example) ‘Hamilton, J,’ from the autocomplete list you are taken directly to a page that lists all of the items for the author. However, we can’t do that any more as we no longer have unique identifiers that group bibliographical items by author. I may be able to do something similar with the page that comes up when you select an author, but this would have to rely on the name to group items together and a name may not be unique.
For the AND I made some tweaks to the website, such as adding a link to the search page if you type some text into the ‘jump to entry’ option and no matching entries are found. I then spent the rest of my time continuing to develop the new content management system, specifically the pages for managing source texts. I finished work on this, adding in facilities to add, edit, browse and delete source texts from the database. I then migrated the DTD to the new site, which is referenced by the editors’ XML editor when they work on the entry XML files. The DTD on the old server referenced several lists of things that are then used to populate drop-down lists of options in the XML editor. I migrated these too, making them dynamically generated from the underlying database rather than statis lists, meaning when (for example) new source texts are added to the CMS these will automatically become available when using the XML editor.
For the Books and Borrowing project I participated in the project’s Zoom call on Monday to discuss the project’s CMS and how to amalgamate the various duplicate author records that resulted from data uploads from different libraries. After the call I made some required changes to the CMS, such as making the editor’s notes fields visible by default again, and worked on the duplicate authors matching script to add in further outputs when comparing the author names with Levenshtein ratings of 1 and 2. I also reviewed some content that was sent to us from another library.
Also this week I responded to an email from James Caudle in Scottish Literature about a potential project he’s setting up, made a couple of changes to the Scots Language Policy website, made some tweaks to the menu structure for the Scots Syntax Atlas project and gave some advice to a post-grad student who had contacted me about setting up a corpus.
This was my first full week back of the year, although it was also the first week of a return to homeschooling, which made working a little trickier than usual. I also had a dentist’s appointment on Tuesday and lost some time to that due to my dentist being near the University rather than where I live. However, despite these challenges I was able to achieve quite a lot this week. I had two Zoom calls, the first on Monday to discuss a new ESRC grant that Jane Stuart-Smith is putting together with colleagues at Strathclyde while the second on Wednesday was with a partner in Joanna Kopaczyk’s new RSC funded project about Scots Language Policy to discuss the project’s website and the survey they’re going to put out. I also made a few tweaks to the DSL website, replied to Kirsteen McCue about the AHRC proposal she’s currently putting together, replied to a query regarding the technologies behind the Scots Syntax Atlas, made a few further updates to the Burns Supper map and replied to a query from Rachel Fletcher in English Language about lemmatising Old English.
Other than these various tasks I split my time between the Anglo-Norman Dictionary and the Books and Borrowing projects. For the former I completed adding explanatory notes to all of the ‘Introducing the AND’ pages. This was a very time consuming task as there were probably about 150 explanatory notes in total to add in, each appearing in a Bootstrap dialog box, and each requiring me to copy the note form the old website, add in any required HTML formatting, find and check all of the links to AND entries on the old site and add these in as required. It was pretty tedious to do, but it feels great to get it done, as the notes were previously just giving 404 errors on the new live site, and I don’t like having such things on a site I’m responsible for. I also migrated the academic articles from the old site to the new one (https://anglo-norman.net/articles/) which also required some manual formatting of the content. There are five other articles that I haven’t managed to migrate yet as they are full of character encoding errors on the old site. Geert is looking for copies of these articles that actually work and I’ll add them in once he’s able to get them to me. I also begin migrating the blog posts to the new site too. Currently the blog is hosted on Blogspot and there are 55 entries, but we’d like these to be an internal part of the new site. Migrating these is going to take some time as it means copying the text (which thankfully retains formatting) and then manually saving and embedding any images in the posts. I’m just going to do a few of these a week until they’re all done and so far I’ve migrated seven. I also needed to look into how the blogs page works in the WordPress theme I created for the AND, as to start with the page was just listing the full text of every post rather than giving summaries and links through to the full text of each. After some investigation I figured out that in my theme there is a script called ‘home.php’ and this is responsible for displaying all of the blog posts on the ‘blog’ page. It in turn calls another template called ‘content-blog.php’ which was previously set to display the full content of the blog. Instead I set it to display the title as a link through to the full post, the date and then an excerpt from the full blog, which can be accessed through a handy WordPress function called ‘the_excerpt()’.
For the Books and Borrowing project I made some improvements and fixes to the Content Management System. I’d been meaning to enhance the CMS for some time, but due to other commitments to other projects I didn’t have the time to delve into it. It felt good to find the time to return to the project this week.
I updated the ‘Books’ and ‘Borrowers’ tabs when viewing a library in the CMS. I added in pagination to speed up the loading of the pages. Pages are now split into 500 record blocks and you can navigate between pages using the links above and below the tables. For some reason the loading of the page is still a bit slow on the Stirling server whereas it was fine on the Glasgow server I was using for test purposes. I’m not entirely sure why as I’d copied the database over too – presumably the Stirling server is slower. However, it is still a massive improvement on the speed of the page previously.
I also changed the way tables scroll horizontally. Previously if a table was wider than the page a scrollbar appeared above and below the table, but this was rather awkward to use if you were looking at the middle of the table (you had to scroll up or down to the beginning or end of the table, then use the horizontal scrollbar to move the table along a bit, then navigate back to the section of the page you were interested in). Now the scrollbar just appears at the bottom of the browser window and can always be accessed no matter where in the table you are.
I also removed the editorial notes from tables by default to reduce clutter, and added in a button for showing / hiding the editors’ notes near the top of each page. I also added a limit option in the ‘Books’ and ‘Borrowers’ pages within a library to limit the displayed records to only those found in a specific ledger. I added in a further option to display those records that are not currently associated with any ledgers too.
I then deleted the ‘original borrowed date’ and ‘original returned date’ fields in St Andrews data as these were no longer required. I deleted these additional fields from the system and all data that were contained in these fields.
It had been noted that the book part numbers were not being listed numerically. As part numbers can contain text as well as numbers (e.g. ‘Vol. II’), this field in the database needed to be set as text rather than an integer. Unfortunately the database doesn’t order numbers correctly when they are contained in a non-numerical field – instead all the ones come first (1, 10, 11) then all the twos (2, 20, 22) etc. However, I managed to find a way to ensure that the numbers are ordered correctly.
I also fixed the ‘Add another Edition/Work to this holding’ button that was not working. This was caused by the Stirling server running a different version of PHP that doesn’t allow functions to have variable numbers of arguments. The autocomplete function was also not working at edition level and I investigated this. The issue was being caused by tab characters appearing in edition titles, and I updated my script to ensure these characters are stripped out before the data is formatted as JSON.
There may be further tweaks to be made – I’ll need to hear back from the rest of the team before I know more, but for now I’m up to date with the project. Next week I intend to get back into some of the larger and more trickier outstanding AND tasks (of which there are, alas, many) and to begin working towards adding the DSL bibliography data into the new version of the API.
Week 19 of Lockdown, and it was a short week for me as the Monday was the Glasgow Fair holiday. I spent a couple of days this week continuing to add features to the content management system for the Books and Borrowing project. I have now implemented the ‘normalised occupations’ part of the CMS. Originally occupations were just going to be a set of keywords, allowing one or more keyword to be associated with a borrower. However, we have been liaising with another project that has already produced a list of occupations and we have agreed to share their list. This is slightly different as it is hierarchical, with a top-level ‘parent’ containing multiple main occupations. E.g. ‘Religion and Clergy’ features ‘Bishop’. However, for our project we needed a third hierarchical level do differentiate types of minister/priest, so I’ve had to add this in too. I’ve achieved this by means of a parent occupation ID in the database, which is ‘null’ for top-level occupations and contains the ID of the parent category for all other occupations.
I completed work on the page to browse occupations, arranging the hierarchical occupations in a nested structure that features a count of the number of borrowers associated with the occupation to the right of the occupation name. These are all currently zero, but once some associations are made the numbers will go up and you’ll be able to click on the count to bring up a list of all associated borrowers, with links through to each borrower. If an occupation has any child occupations a ‘+’ icon appears beside it. Press on this to view the child occupations, which also have counts. The counts for ‘parent’ occupations tally up all of the totals for the child occupations, and clicking on one of these counts will display all borrowers assigned to all child occupations. If an occupation is empty there is a ‘delete’ button beside it. As the list of occupations is going to be fairly fixed I didn’t add in an ‘edit’ facility – if an occupation needs editing I can do it directly through the database, or it can be deleted and a new version created. Here’s a screenshot showing some of the occupations in the ‘browse’ page:
I also created facilities to add new occupations. You can enter an occupation name and optionally specify a parent occupation from a drop-down list. Doing so will add the new occupation as a child of the selected category, either at the second level if a top level parent is selected (e.g. ‘Agriculture’) or at the third level if a second level parent is selected (e.g. ‘Farmer’). If you don’t include a parent the occupation will become a new top-level grouping. I used this feature to upload all of the occupations, and it worked very well.
I then updated the ‘Borrowers’ tab in the ‘Browse Libraries’ page to add ‘Normalised Occupation’ to the list of columns in the table. The ‘Add’ and ‘Edit’ borrower facilities also now feature ‘Normalised Occupation’, which replicates the nested structure from the ‘browse occupations’ page, only features checkboxes beside each main occupation. You can select any number of occupations for a borrower and when you press the ‘Upload’ or ‘Edit’ button your choice will be saved. Deselecting all ticked checkboxes will clear all occupations for the borrower. If you edit a borrower who has one or more occupations selected, in addition to the relevant checkboxes being ticked, the occupations with their full hierarchies also appear above the list of occupations, so you can easily see what is already selected. I also updated the ‘Add’ and ‘Edit’ borrowing record pages so that whenever a borrower appears in the forms the normalised occupations feature also appears.
I also added in the option to view page images. Currently the only ledgers that have page images are the three Glasgow ones, but more will be added in due course. When viewing a page in a ledger that includes a page image you will see the ‘Page Image’ button above the table of records. Press on this and a new browser tab will open. It includes a link through to the full-size image of the page if you want to open this in your browser or download it to open in a graphics package. It also features the ‘zoom and pan’ interface that allows you to look at the image in the same manner as you’d look at a Google Map. You can also view this full screen by pressing on the button in the top right of the image.
Also this week I made further tweaks to the script I’d written to update lexeme start and end dates in the Historical Thesaurus based on citation dates in the OED. I’d sent a sample output of 10,000 rows to Fraser last week and he got back to me with some suggestions and observations. I’m going to have to rerun the script I wrote to extract the more than 3 million citation dates from the OED as some of the data needs to be processed differently, but as this script will take several days to run and I’m on holiday next week this isn’t something I can do right now. However, I managed to change the way the date matching script runs to fix some bugs and make the various processes easier to track. I also generated a list of all of the distinct labels in the OED data, with counts of the number of times these appear. Labels are associated with specific citation dates, thankfully. Only a handful are actually used lots of times, and many of the others appear to be used as a ‘notes’ field rather than as a more general label.
In addition to the above I also had a further conversation with Heather Pagan about the data management plan for the AND’s new proposal, responded to a query from Kathryn Cooper about the website I set up for her at the end of last year, responded to a couple of separate requests from post-grad students in Scottish Literature, spoke to Thomas Clancy about the start date for his Place-Names of Iona project, which got funded recently, helped with some issues with Matthew Creasy’s Scottish Cosmopolitanism website and spoke to Carole Hough about making a few tweaks to the Berwickshire Place-names website for REF.
I’m going to be on holiday for the next two weeks, so there will be no further updates from me for a while.
This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day. I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects. I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records. This week I began by focussing on the data from St Andrews University library.
The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting. However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.
The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100. The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers. Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.
I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn. From the filenames my script could ascertain the ledger volume, its dates and the page number. For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10. The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.
The actual data in the file posed further problems. As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system. Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows). I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically. The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’). I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.
I also used this approach to set up books and borrowers too. If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which. However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows. I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code. There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.
The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system. I created four new ‘Additional Fields’ for St Andrews as follows:
- Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
- Code: This contains the full text of the second column (e.g. J.5.2)
- Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
- Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)
In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links. Where subsequent rows contain data in this column but no code, this data is then appended to the transcription. E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.
The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required. Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes. E.g. for the above the notes field contains:
‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit & au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. ) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>
—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>
—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>
As mentioned earlier, the script also generates book and borrower records based on the linked pages too. I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews. In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field. One book item is created for each holding to be used to link to the corresponding borrowing records.
For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’). All borrowers are linked to records as ‘Main’ borrowers.
During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table. I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page. After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.
I then moved onto looking at the data for Selkirk library. This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers. Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated. The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.
As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records. There are 612 Holding records for Selkirk. I also processed the borrower records, resulting in 86 borrower records being added. I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.
Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project. I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename. I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included. Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.
Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.
Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office. This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.
Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case. Following these guidelines the filenames would look something like this:
In addition to the Books and Borrowing project I worked on a number of other projects this week. I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.
I also worked a bit more on using dates from the OED data in the Historical Thesaurus. Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT. I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels. I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without. I decided to process things as follows:
- ?a will just be ‘a’
- ?c will just be ‘c’
- ? without an ‘a’ or ‘c’ will be ‘c’.
I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this. I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’. I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).
While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted. This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table. This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.
I set my extraction script running on the OED XML files on Wednesday and processing took a long time. The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows. It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.
I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project. The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category. Variant spellings are also removed (these were all tagged with <form> and I removed all of these). I also created a new field to store only the ‘NN_’ tagged words and remove all others.
The scripts generated three datasets, which I saved as spreadsheets for Fraser. The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading. The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category. For this table I’ve also added in the full list of lexemes for each HT category too.
I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.
Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data. This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.
I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent. The entry for snd13897 contains the following URLs:
The first is the ID for the main entry and the other two are child entries. If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged. But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content). The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above). For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217. But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.
The entry for dost16606 contains the following URLs:
(in addition to headword URLs). Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content). As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.
What we need to do is remove the child entries that still exist as separate entries in the data. To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files. It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID. If it does then presumably this is a duplicate that should then be deleted. I’m waiting to hear back from the DSL people to see how we should proceed with this.
As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.
I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.
I was on holiday last week and had quite a stack of things to do when I got back in on Monday. This included setting up a new project website for a student in Scottish Literature who had received some Carnegie funding for a project and preparing for an interview panel that I was on, with the interview taking place on Friday. I also responded to Alison Wiggins about the content management system I’d created for her Mary Queen of Scots letters project and had a discussion with someone in Central IT Services about the App and Play Store accounts that I’ve been managing for several years now. It’s looking like responsibility for this might be moving to IT services, which I think makes a lot of sense. I also gave some advice to a PhD student about archiving and preserving her website and engaged in a long email discussion with Heather Pagan of the Anglo-Norman Dictionary about sorting out their data and possibly migrating it to a new system. On Wednesday I had a meeting with the SCOSYA team about further developments of the public atlas. We decided on another few requirements and discussed timescales for the completion of the work. They’re hoping to be able to engage in some user testing in the middle of September, so I need to try and get everything completed before then. I had hoped to start on some of this on Thursday, but I was struck down by a really nasty cold that I’ve still not shaken yet, which made focussing on such tricky tasks as getting questionnaire areas to highlight when clicked on rather difficult.
I spent most of the rest of the week working for DSL in various capacities. I’d put in a request to get Apache Solr installed on a new server, so we could use this for free-text searching and thankfully Arts IT Support agreed to do this. A lot of my week was spent preparing the data from both the ‘v2’ version of the DSL (the data outputted from the original API, but with full quotes and everything pre-generated rather than being created on the fly every time an entry is requested) and the ‘v3’ API (data taken from the editing server and outputted by a script written by Thomas Widmann) so that it could be indexed by Solr. Raymond from Arts IT Support set up an instance of Solr on a new server and I created scripts that went through all 90,000 DSL entries in both versions and generated full-text versions of the entries that stripped out the XML tags. For each set I created three versions – on that was ‘full text’, one that was full text without the quotations and the other that was just the quotations. The script outputted this data in a format that Solr could work with and I sent this on to Raymond for indexing. The first test version I sent Raymond was just the full text, and Solr managed to index this without incident. However, the other views of the text required working with the XML a bit, and this appears to have brought in some issues with special characters that Solr is not liking. I’m still in the middle of sorting this out and will continue to look into it next week, but progress with the free-text searching is definitely being made and it looks like the new API will be able to offer the same level of functionality as the existing API. I also ensured I documented the process of generating all of the data from the XML files outputted by the editing system through to preparing the full-text for indexing by Solr, so next time we come to update the data we will know exactly what to do. This is much better than how things previously stood, as the original API is entirely a ‘black box’ with no documentation whatsoever as to how to update the data contained therein.
Also during this time I engaged in an email conversation about managing the dictionary entries and things like cross references with Ann Ferguson and the people who will be handling the new editor software for the dictionary, and helped to migrate control for the email part of the DSL domain to the control of the DSL’s IT people. We’re definitely making progress with sorting out the DSL’s systems, which is really great.
I’m going to be working for just three days over the next two weeks, and all of these days will be out of the office, so I’ll just need to see how much time I have to continue with the DSL tasks, especially as work for the SCOSYA project is getting rather urgent.
Monday was a holiday this week, so Tuesday was the start of my working week. I spent about half the day completing work on the Data Management Plan that I had been asked to write by the College of Arts research people, and the remainder of the day continuing to write scripts the help in the linkup of HT and OED lexeme data. The latest script gets all unmatched HT words that are monosemous within part of speech in the unmatched dataset. For each of these the script then retrieves all OED words where the stripped form matches, as does POS, but the words are already matched to a different HT lexeme. If there is more than one OED lexeme matched to the HT lexeme I’ve added the information on subsequent rows in the table, so that full OED category information can more easily be read. I’m not entirely sure what this script will be used for, but Fraser seems to think it will be useful in automatically pinpointing certain words that the OED are currently trying to manually track down.
During the week I also made some further updates to a couple of song stories for the RNSN project and had a meeting over the phone with Kirsteen McCue about a new project she’s in the planning stages for at the moment, and which I will be helping with the technical aspects for. I also had a meeting with PhD student Ewa Wanat about a website she’s putting together and gave her some advice about things.
The rest of my week was split between DSL and SCOSYA. For DSL I spent time answering a number of emails. I then went through the SND data that had been outputted by Thomas Widmann’s scripts on the DSL server. I had tried running this data through the script I’d written to take the outputted XML data and insert it into our online MySQL database, but my script was giving errors, stating that the input file wasn’t valid XML. I loaded the file (an almost 90Mb text file) into Oxygen and asked it to validate the XML. It took a while, but managed to find one easily identifiable error, and one error that was trickier to track down.
In the entry for snd22907 there was a closing </sense> tag in the ‘History’ that has no corresponding opening tag. This was easy to track down and manually fix. Entry snd12737 had two opening tags (<entry id=”snd12737″>) one below the <meta> tag. This was trickier to find as I needed to manually track it down by chopping the file in half, checking which half the error was in, chopping this bit in half and so on until I ended up with a very small file in which it was easy to locate the problem.
With the SND data fixed I could then run it through my script. However, I wanted to change the way the script worked based on feedback from Ann last week. Previously I had added new fields to a test version of the main database, and the script found the matching row and inserted new data. I decided instead to create an entirely new table for the new data, to keep things more cleanly divided, and to handle the possibility of there being new entries in the data that were not present in the existing database. I also needed to update the way in which the URL tag was handled, as Ann had explained that there could be any number of URL tags, with them referencing other entries that has been merged with the current one. After updating my test version of the database to make new tables and fields, and updating my script to take these changes into consideration I ran both the DOST and the SND data through the script, resulting in 50,373 entries for DOST and 34,184 entries for SND. This is actually less entries than in the old database. There are 3023 missing SND entries and 1994 missing DOST entries. They are all supplemental entries (with IDs starting ‘snds’ in SND and ‘adds’ in DOST). This leaves just 24 DOST ‘adds’ entries in the Sienna data and 2730 SND ‘snds’ entries. I’m not sure what’s going on with the output – whether the omission of these entries is intentional (because the entries have been merged with regular entries) or whether this is an error, but I have exported information about the missing rows and have sent these on to Ann for further investigation.
For SCOSYA I focussed on adding in the sample sound clips and groupings for location markers. I also engaged with some preparations for the project’s ‘data hack’ that will be taking place in mid June. Adding in sound clips took quite a bit of time, as I needed to update both the Content Management System to allow sound clips to be uploaded and managed, and the API to incorporate links to the uploaded sound clips. This is in addition to incorporating the feature into the front-end.
Now if a member of staff logs into the CMS and goes to the ‘Browse codes’ page they will see a new column that lists the number of sound clips associated with a code. I’ve currently uploaded the four for E1 ‘This needs washed’ for test purposes. From the table, clicking on a code loads its page, which now includes a new section for sound clips. Any previously uploaded ones are listed and can be played or deleted. New clips in MP3 format can also be uploaded here, with files being renamed upon upload, based on the code and the next free auto-incrementing number in the database.
In the API all soundfiles are included in the ‘attributes’ endpoint, which is used by the drop-down list in the atlas. The public atlas has also been updated to include buttons to play any sound clips that are available, as the screenshot towards the end of this post demonstrates.
There is now a new section labelled ‘Listen’ with four ‘Play’ icons. Pressing on one of these plays a different sound clip. Getting these icons to work has been more tricky than you might expect. HTML5 has a tag called <audio> that browsers can interpret in order to create their own audio player which is then embedded in the page. This is what happens in the CMS. Unfortunately an interface designer has no control over the display of the player – it’s different in every browser and generally takes up a lot of room, which we don’t really have. I initially just used the HTML5 audio player but each sound clip then had to appear on a new row and the player in Chrome was too wide for the side panel.
I then moved on to looking at the groups for locations. This will be a fixed list of groups, taken from ones that the team has already created. I copied these groups to a new location in the database and updated the API to create new endpoints for listing groups and bringing back the IDs of all locations that are contained within a specified group. I haven’t managed to get the ‘groups’ feature fully working yet, but the selection options are now in place. There’s a ‘groups’ button in the ‘Examples’ section, and when you click on this a section appears listing the groups. Each group appears as a button. When you click on one a ‘tick’ is added to the button and currently the background turns a highlighted green colour. I’m going to include several different highlighted colours so the buttons light up differently. Although it doesn’t work yet, these colours will then be applied to the appropriate group of points / areas on the map. You can see an example of the buttons below:
The only slight reservation I have is that this option does make the public atlas more complicated to use and more cluttered. I guess it’s ok as the groups are hidden by default, though. There also may be an issue with the sidebar getting too long for narrow screens that I’ll need to investigate.