I was back at work this week after having a lovely holiday in Northumberland last week. I spent quite a bit of time in the early part of the week catching up with emails that had come in whilst I’d been off. I fixed an issue with Bryony Randall’s https://imprintsarteditingmodernism.glasgow.ac.uk/ site, which was put together by an external contractor, but I have now inherited. The site menu would not update via the WordPress admin interface and after a bit of digging around in the source files for the theme it would appear that the theme doesn’t display a menu anywhere, that is the menu which is editable from the WordPress Admin interface is not the menu that’s visible on the public site. That menu is generated in a file called ‘header.php’ and only pulls in pages / posts that have been given one of three specific categories: Commissioned Artworks, Commissioned Text or Contributed Text (which appear as ‘Blogs’). Any post / page that is given one of these categories will automatically appear in the menu. Any post / page that is assigned to a different category or has no assigned category doesn’t appear. I added a new category to the ‘header’ file and the missing posts all automatically appeared in the menu.
I also updated the introductory texts in the mockups for the STAR websites and replied to a query about making a place-names website from a student at Newcastle. I spoke to Simon Taylor about a talk he’s giving about the place-name database and gave him some information on the database and systems I’d created for the projects I’ve been involved with. I also spoke to the Iona Place-names people about their conference and getting the website ready for this.
I also had a chat with Luca Guariento about a new project involving the team from the Curious Travellers project. As this is based in Critical Studies Luca wondered whether I’d write the Data Management Plan for the project and I said I would. I spent quite a bit of time during the rest of the week reading through the bid documentation, writing lists of questions to ask the PI, emailing the PI, experimenting with different technologies that the project might use and beginning to write the Plan, which I aim to complete next week.
The project is planning on running some pre-digitised images of printed books through an OCR package and I investigated this. Google owns and uses a program called Tesseract to run OCR for Google Books and Google Docs and it’s freely available (https://opensource.google/projects/tesseract). It’s part of Google Docs – if you upload an image of text into Google Drive then open it in Google Docs the image will be automatically OCRed. I took a screenshot of one of the Welsh tour pages (https://viewer.library.wales/4690846#?c=&m=&s=&cv=32&manifest=https%3A%2F%2Fdamsssl.llgc.org.uk%2Fiiif%2F2.0%2F4690846%2Fmanifest.json&xywh=-691%2C151%2C4725%2C3632) and cropped the text and then opened it in Google Docs and even on this relatively low resolution image the OCR results are pretty decent. It managed to cope with most (but not all) long ‘s’ characters and there are surprisingly few errors – ‘Englija’ and ‘Lotty’ are a couple and have been caused by issues with the original print quality. I’d say using Tesseract is going to be suitable for the project.
I spent a bit of time working on the Speak For Yersel project. We had a team meeting on Thursday to go through in detail how one of the interactive exercises will work. This one will allow people to listed to a sound clip and then relisten to it in order to click whenever they hear something that identifies the speaker as coming from a particular location. Before the meeting I’d prepared a document giving an overview of the technical specification of the feature and we had a really useful session discussing the feature and exactly how it should function. I’m hoping to make a start on a mockup of the feature next week.
Also for the project I’d enquired with Arts IT Support as to whether the University held a license for ArcGIS Online, which can be used to publish maps online. It turns out that there is a University-wide license for this which is managed by the Geography department and a very helpful guy called Craig MacDonell arranged for me and the other team members to be set up with accounts for it. I spent a bit of time experimenting with the interface and managed to publish a test heatmap based on data from SCOSYA. I can’t get it to work directly with the SCOSYA API as it stands, but after exporting and tweaking one of the sets of rating data as a CSV I pretty quickly managed to make a heatmap based on the ratings and publish it: https://glasgow-uni.maps.arcgis.com/apps/instant/interactivelegend/index.html?appid=9e61be6879ec4e3f829417c12b9bfe51 This is just a really simple test, but we’d be able to embed such a map in our website and have it pull in data dynamically from CSVs generated in real-time and hosted on our server.
Also this week I had discussions with the Dictionaries of the Scots Language people about how dates will be handled. Citation dates are being automatically processed to add in dates as attributes that can then be used for search purposes. Where there are prefixes such as ‘a’ and ‘c’ the dates are going to be given ranges based on values for these prefixes. We had a meeting to discuss the best way to handle this. Marc had suggested that having a separate prefix attribute rather than hard coding the resulting ranges would be best. I agreed with Marc that having a ‘prefix’ attribute would be a good idea, not only because it means we can easily tweak the resulting date ranges at a later point rather than having them hard-coded, but also because it then gives us an easy way to identify ‘a’, ‘c’ and ‘?’ dates if we ever want to do this. If we only have the date ranges as attributes then picking out all ‘c’ dates (e.g. show me all citations that have a date between 1500 and 1600 that are ‘c’) would require looking at the contents of each date tag for the ‘c’ character which is messier.
A concern was raised that not having the exact dates as attributes would require a lot more computational work for the search function, but I would envisage generating and caching the full date ranges when the data is imported into the API so this wouldn’t be an issue. However, there is a potential disadvantage to not including the full date range as attributes in the XML, and this is that if you ever want to use the XML files in another system and search the dates through it the full ranges would not be present in the XML so would require processing before they could be used. But whether the date range is included in the XML or not I’d say it’s important to have the ‘prefix’ as an attribute, unless you’re absolutely sure that being able to easily identify dates that have a particular prefix isn’t important.
We decided that prefixes would be stored as attributes and that the date ranges for dates with a prefix would be generated whenever the data is exported from the DSL’s editing system, meaning editors wouldn’t have to deal with noting the date ranges and all the data would be fully usable without further processing as soon as it’s exported.
Also this week I was given access to a large number of images of registers from the Advocates Library that had been digitised by the NLS. I downloaded these, batch processed them to add in the register numbers as a prefix to the filenames, uploaded the images to our server, created register records for each register and page records for each page. The registers, pages and associated images can all now be accessed via our CMS.
My final task of the week was to continue work on the Anglo-Norman Dictionary. I completed work on the script identifies which citations have varlists and which may need to have their citation date updated based on one of the forms in the varlist. What the script does is to retrieve all entries that have a <varlist> somewhere in them. It then grabs all of the forms in the <head> of the entry. It then goes through every attestation (main sense and subsense plus locution sense and subsense) and picks out each one that has a <varlist> in it.
For each of these it then extracts the <aform> if there is one, or if there’s not then it extracts the final word before the <varlist>. It runs a Levenshtein test on this ‘aform’ to ascertain how different it is from each of the <head> forms, logging the closest match (0 = exact match of one form, 1 = one character different from one of the forms etc). It then picks out each <ms_form> in the <varlist> and runs the same Levenshtein test on each of these against all forms in the <head>.
If the score for the ‘aform’ is lower or equal to the lowest score for an <ms_form> then the output is added to the ‘varlist-aform-ok’ spreadsheet. If the score for one of the <ms_form> words is lower than the ‘aform’ score the output is added to the ‘varlist-vform-check’ spreadsheet.
My hope is that by using the scores we can quickly ascertain which are ok and which need to be looked at by ordering the rows by score and dealing with the lowest scores first. In the first spreadsheet there are 2187 rows that have a score of 0. This means the ‘aform’ exactly matches one of the <head> forms. I would imagine that these can safely be ignored. There are a further 872 that have a score of 1, and we might want to have a quick glance through these to check they can be ignored, but I suspect most will be fine. The higher the score the greater the likelihood that the ‘aform’ is not the form that should be used for dating purposes and one of the <varlist> forms should instead. These would need to be checked and potentially updated.
The other spreadsheet contains rows where a <varlist> form has a lower score than the ‘aform’ – i.e. one of the <varlist> forms is closer to one of the <head> forms than the ‘aform’ is. These are the ones that are more likely to have a date that needs updated. The ‘Var forms’ column lists each var form and its corresponding score. It is likely that the var form with the lowest score is the form that we would need to pick the date out for.
In terms of what the editors could do with the spreadsheets: My plan was that we’d add an extra column to note whether a row needs updated or not – maybe called ‘update’ – and be left blank for rows that they think look ok as they are and containing a ‘Y’ for rows that need to be updated. For such rows they could manually update the XML column to add in the necessary date attributes. Then I could process the spreadsheet in order to replace the quotation XML for any attestations that needs updated.
For the ‘vform-check’ spreadsheet I could update my script to automatically extract the dates for the lowest scoring form and attempt to automatically add in the required XML attributes for further manual checking, but I think this task will require quite a lot of manual checking from the onset so it may be best to just manually edit the spreadsheet here too.
I headed into the University for the first time this year on Wednesday this week to collect a new iPad that I’d ordered and to get some files from my office. It was great to see the old place again, but it did take quite a chunk out of my day to travel there and back, especially as I’m still home-schooling either a morning or an afternoon each day at the moment too.
As with last week, I mainly divided my time this week between the Dictionary of the Scots Language, the Anglo-Norman Dictionary and the Books and Borrowing project, with a few other bits and bobs added in as well. For the DSL I retrieved the source code for my original Scots School Dictionary app from my office so we can host this somewhere on the DSL website. This is because the DSL have commissioned someone else to make a new School Dictionary app, which launched this week, but doesn’t include an ‘English to Scots’ feature as the old app does, so we’re going to make the old app available as a website for those people who miss the feature. I also made a few minor tweaks to the main DSL site, and then focussed on adding bibliography search facilities to the new version of the API, a task that I’d begun last week.
I created a new table for the bibliographical data that includes the various fields used for DOST (note, author, editor, date, longtitle etc) and a field for the XML data used for SND. I then created two further tables for searching, one that contains every author and editor name for each item (for DOST there may be different names in the author, editor, longauthor and longeditor fields while for SND there may be any number of <author> tags) and the other containing every title for each item (DOST may have different text in title and longtitle while SND items can have any number of <title> tags). These tables allow you to search for any variant author, editor or title and find the item.
I also created two additional fields in the bibliography table that contain the ‘display author’ and ‘display title’. These are the forms that get displayed in the search results before you click on an item to open the full bibliographical entry. I then updated the V4 API to add in facilities to search and retrieve the bibliographies. I didn’t have the time to connect to this API and to implement the search on the Sienna test site, which is something I hope to do next week, but the logic behind the search and display of bibliographies is all there. There is a predictive search that will be used to generate the autocomplete list, similar to how the live site currently works: You will be able to select whether your search is for authors, titles or both and when you start typing in some text a list of matching items will appear, e.g. typing in ‘ham’ for authors in both dictionaries will display the following all items containing ‘ham’ and when you select an item this will then perform a search for the specific text. You will then be able to click on an item to view the full bibliography. This is a bit different to how the live site currently works, as with these if you enter ‘ham’ and select (for example) ‘Hamilton, J,’ from the autocomplete list you are taken directly to a page that lists all of the items for the author. However, we can’t do that any more as we no longer have unique identifiers that group bibliographical items by author. I may be able to do something similar with the page that comes up when you select an author, but this would have to rely on the name to group items together and a name may not be unique.
For the AND I made some tweaks to the website, such as adding a link to the search page if you type some text into the ‘jump to entry’ option and no matching entries are found. I then spent the rest of my time continuing to develop the new content management system, specifically the pages for managing source texts. I finished work on this, adding in facilities to add, edit, browse and delete source texts from the database. I then migrated the DTD to the new site, which is referenced by the editors’ XML editor when they work on the entry XML files. The DTD on the old server referenced several lists of things that are then used to populate drop-down lists of options in the XML editor. I migrated these too, making them dynamically generated from the underlying database rather than statis lists, meaning when (for example) new source texts are added to the CMS these will automatically become available when using the XML editor.
For the Books and Borrowing project I participated in the project’s Zoom call on Monday to discuss the project’s CMS and how to amalgamate the various duplicate author records that resulted from data uploads from different libraries. After the call I made some required changes to the CMS, such as making the editor’s notes fields visible by default again, and worked on the duplicate authors matching script to add in further outputs when comparing the author names with Levenshtein ratings of 1 and 2. I also reviewed some content that was sent to us from another library.
Also this week I responded to an email from James Caudle in Scottish Literature about a potential project he’s setting up, made a couple of changes to the Scots Language Policy website, made some tweaks to the menu structure for the Scots Syntax Atlas project and gave some advice to a post-grad student who had contacted me about setting up a corpus.
This was my first full week back of the year, although it was also the first week of a return to homeschooling, which made working a little trickier than usual. I also had a dentist’s appointment on Tuesday and lost some time to that due to my dentist being near the University rather than where I live. However, despite these challenges I was able to achieve quite a lot this week. I had two Zoom calls, the first on Monday to discuss a new ESRC grant that Jane Stuart-Smith is putting together with colleagues at Strathclyde while the second on Wednesday was with a partner in Joanna Kopaczyk’s new RSC funded project about Scots Language Policy to discuss the project’s website and the survey they’re going to put out. I also made a few tweaks to the DSL website, replied to Kirsteen McCue about the AHRC proposal she’s currently putting together, replied to a query regarding the technologies behind the Scots Syntax Atlas, made a few further updates to the Burns Supper map and replied to a query from Rachel Fletcher in English Language about lemmatising Old English.
Other than these various tasks I split my time between the Anglo-Norman Dictionary and the Books and Borrowing projects. For the former I completed adding explanatory notes to all of the ‘Introducing the AND’ pages. This was a very time consuming task as there were probably about 150 explanatory notes in total to add in, each appearing in a Bootstrap dialog box, and each requiring me to copy the note form the old website, add in any required HTML formatting, find and check all of the links to AND entries on the old site and add these in as required. It was pretty tedious to do, but it feels great to get it done, as the notes were previously just giving 404 errors on the new live site, and I don’t like having such things on a site I’m responsible for. I also migrated the academic articles from the old site to the new one (https://anglo-norman.net/articles/) which also required some manual formatting of the content. There are five other articles that I haven’t managed to migrate yet as they are full of character encoding errors on the old site. Geert is looking for copies of these articles that actually work and I’ll add them in once he’s able to get them to me. I also begin migrating the blog posts to the new site too. Currently the blog is hosted on Blogspot and there are 55 entries, but we’d like these to be an internal part of the new site. Migrating these is going to take some time as it means copying the text (which thankfully retains formatting) and then manually saving and embedding any images in the posts. I’m just going to do a few of these a week until they’re all done and so far I’ve migrated seven. I also needed to look into how the blogs page works in the WordPress theme I created for the AND, as to start with the page was just listing the full text of every post rather than giving summaries and links through to the full text of each. After some investigation I figured out that in my theme there is a script called ‘home.php’ and this is responsible for displaying all of the blog posts on the ‘blog’ page. It in turn calls another template called ‘content-blog.php’ which was previously set to display the full content of the blog. Instead I set it to display the title as a link through to the full post, the date and then an excerpt from the full blog, which can be accessed through a handy WordPress function called ‘the_excerpt()’.
For the Books and Borrowing project I made some improvements and fixes to the Content Management System. I’d been meaning to enhance the CMS for some time, but due to other commitments to other projects I didn’t have the time to delve into it. It felt good to find the time to return to the project this week.
I updated the ‘Books’ and ‘Borrowers’ tabs when viewing a library in the CMS. I added in pagination to speed up the loading of the pages. Pages are now split into 500 record blocks and you can navigate between pages using the links above and below the tables. For some reason the loading of the page is still a bit slow on the Stirling server whereas it was fine on the Glasgow server I was using for test purposes. I’m not entirely sure why as I’d copied the database over too – presumably the Stirling server is slower. However, it is still a massive improvement on the speed of the page previously.
I also changed the way tables scroll horizontally. Previously if a table was wider than the page a scrollbar appeared above and below the table, but this was rather awkward to use if you were looking at the middle of the table (you had to scroll up or down to the beginning or end of the table, then use the horizontal scrollbar to move the table along a bit, then navigate back to the section of the page you were interested in). Now the scrollbar just appears at the bottom of the browser window and can always be accessed no matter where in the table you are.
I also removed the editorial notes from tables by default to reduce clutter, and added in a button for showing / hiding the editors’ notes near the top of each page. I also added a limit option in the ‘Books’ and ‘Borrowers’ pages within a library to limit the displayed records to only those found in a specific ledger. I added in a further option to display those records that are not currently associated with any ledgers too.
I then deleted the ‘original borrowed date’ and ‘original returned date’ fields in St Andrews data as these were no longer required. I deleted these additional fields from the system and all data that were contained in these fields.
It had been noted that the book part numbers were not being listed numerically. As part numbers can contain text as well as numbers (e.g. ‘Vol. II’), this field in the database needed to be set as text rather than an integer. Unfortunately the database doesn’t order numbers correctly when they are contained in a non-numerical field – instead all the ones come first (1, 10, 11) then all the twos (2, 20, 22) etc. However, I managed to find a way to ensure that the numbers are ordered correctly.
I also fixed the ‘Add another Edition/Work to this holding’ button that was not working. This was caused by the Stirling server running a different version of PHP that doesn’t allow functions to have variable numbers of arguments. The autocomplete function was also not working at edition level and I investigated this. The issue was being caused by tab characters appearing in edition titles, and I updated my script to ensure these characters are stripped out before the data is formatted as JSON.
There may be further tweaks to be made – I’ll need to hear back from the rest of the team before I know more, but for now I’m up to date with the project. Next week I intend to get back into some of the larger and more trickier outstanding AND tasks (of which there are, alas, many) and to begin working towards adding the DSL bibliography data into the new version of the API.
Week 19 of Lockdown, and it was a short week for me as the Monday was the Glasgow Fair holiday. I spent a couple of days this week continuing to add features to the content management system for the Books and Borrowing project. I have now implemented the ‘normalised occupations’ part of the CMS. Originally occupations were just going to be a set of keywords, allowing one or more keyword to be associated with a borrower. However, we have been liaising with another project that has already produced a list of occupations and we have agreed to share their list. This is slightly different as it is hierarchical, with a top-level ‘parent’ containing multiple main occupations. E.g. ‘Religion and Clergy’ features ‘Bishop’. However, for our project we needed a third hierarchical level do differentiate types of minister/priest, so I’ve had to add this in too. I’ve achieved this by means of a parent occupation ID in the database, which is ‘null’ for top-level occupations and contains the ID of the parent category for all other occupations.
I completed work on the page to browse occupations, arranging the hierarchical occupations in a nested structure that features a count of the number of borrowers associated with the occupation to the right of the occupation name. These are all currently zero, but once some associations are made the numbers will go up and you’ll be able to click on the count to bring up a list of all associated borrowers, with links through to each borrower. If an occupation has any child occupations a ‘+’ icon appears beside it. Press on this to view the child occupations, which also have counts. The counts for ‘parent’ occupations tally up all of the totals for the child occupations, and clicking on one of these counts will display all borrowers assigned to all child occupations. If an occupation is empty there is a ‘delete’ button beside it. As the list of occupations is going to be fairly fixed I didn’t add in an ‘edit’ facility – if an occupation needs editing I can do it directly through the database, or it can be deleted and a new version created. Here’s a screenshot showing some of the occupations in the ‘browse’ page:
I also created facilities to add new occupations. You can enter an occupation name and optionally specify a parent occupation from a drop-down list. Doing so will add the new occupation as a child of the selected category, either at the second level if a top level parent is selected (e.g. ‘Agriculture’) or at the third level if a second level parent is selected (e.g. ‘Farmer’). If you don’t include a parent the occupation will become a new top-level grouping. I used this feature to upload all of the occupations, and it worked very well.
I then updated the ‘Borrowers’ tab in the ‘Browse Libraries’ page to add ‘Normalised Occupation’ to the list of columns in the table. The ‘Add’ and ‘Edit’ borrower facilities also now feature ‘Normalised Occupation’, which replicates the nested structure from the ‘browse occupations’ page, only features checkboxes beside each main occupation. You can select any number of occupations for a borrower and when you press the ‘Upload’ or ‘Edit’ button your choice will be saved. Deselecting all ticked checkboxes will clear all occupations for the borrower. If you edit a borrower who has one or more occupations selected, in addition to the relevant checkboxes being ticked, the occupations with their full hierarchies also appear above the list of occupations, so you can easily see what is already selected. I also updated the ‘Add’ and ‘Edit’ borrowing record pages so that whenever a borrower appears in the forms the normalised occupations feature also appears.
I also added in the option to view page images. Currently the only ledgers that have page images are the three Glasgow ones, but more will be added in due course. When viewing a page in a ledger that includes a page image you will see the ‘Page Image’ button above the table of records. Press on this and a new browser tab will open. It includes a link through to the full-size image of the page if you want to open this in your browser or download it to open in a graphics package. It also features the ‘zoom and pan’ interface that allows you to look at the image in the same manner as you’d look at a Google Map. You can also view this full screen by pressing on the button in the top right of the image.
Also this week I made further tweaks to the script I’d written to update lexeme start and end dates in the Historical Thesaurus based on citation dates in the OED. I’d sent a sample output of 10,000 rows to Fraser last week and he got back to me with some suggestions and observations. I’m going to have to rerun the script I wrote to extract the more than 3 million citation dates from the OED as some of the data needs to be processed differently, but as this script will take several days to run and I’m on holiday next week this isn’t something I can do right now. However, I managed to change the way the date matching script runs to fix some bugs and make the various processes easier to track. I also generated a list of all of the distinct labels in the OED data, with counts of the number of times these appear. Labels are associated with specific citation dates, thankfully. Only a handful are actually used lots of times, and many of the others appear to be used as a ‘notes’ field rather than as a more general label.
In addition to the above I also had a further conversation with Heather Pagan about the data management plan for the AND’s new proposal, responded to a query from Kathryn Cooper about the website I set up for her at the end of last year, responded to a couple of separate requests from post-grad students in Scottish Literature, spoke to Thomas Clancy about the start date for his Place-Names of Iona project, which got funded recently, helped with some issues with Matthew Creasy’s Scottish Cosmopolitanism website and spoke to Carole Hough about making a few tweaks to the Berwickshire Place-names website for REF.
I’m going to be on holiday for the next two weeks, so there will be no further updates from me for a while.
This was week 15 of Lockdown, which I guess is sort of coming to an end now, although I will still be working from home for the foreseeable future and having to juggle work and childcare every day. I continued to work on the Books and Borrowing project for much of this week, this time focussing on importing some of the existing datasets from previous transcription projects. I had previously written scripts to import data from Glasgow University library and Innerpeffray library, which gave us 14,738 borrowing records. This week I began by focussing on the data from St Andrews University library.
The St Andrews data is pretty messy, reflecting the layout and language of the original documents, so I haven’t been able to fully extract everything and it will require a lot of manual correcting. However, I did manage to migrate all of the data to a test version of the database running on my local PC and then updated the online database to incorporate this data.
The data I’ve got are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this: https://arts.st-andrews.ac.uk/transcribe/index.php?title=Page:UYLY205_2_Receipt_Book_1748-1753.djvu/100. The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers. Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.
I wrote a script that pulled in all of the files in the ‘HTML’ directory and processed each in turn. From the filenames my script could ascertain the ledger volume, its dates and the page number. For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10. The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.
The actual data in the file posed further problems. As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system. Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows). I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically. The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’). I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.
I also used this approach to set up books and borrowers too. If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to know which is which. However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows. I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code. There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.
The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system. I created four new ‘Additional Fields’ for St Andrews as follows:
- Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
- Code: This contains the full text of the second column (e.g. J.5.2)
- Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
- Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)
In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links. Where subsequent rows contain data in this column but no code, this data is then appended to the transcription. E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.
The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required. Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes. E.g. for the above the notes field contains:
‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit & au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. ) <a href=”http://library.st-andrews.ac.uk/record=b2447402~S1″>http://library.st-andrews.ac.uk/record=b2447402~S1</a></p>
—– profr_Shaws| <p><a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1409683484</a></p>
—– James_Key| <p>Possibly James Kay: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1389455860</a></p>
As mentioned earlier, the script also generates book and borrower records based on the linked pages too. I’ve chosen to set up book holding rather than book edition records as the details are all very vague and specific to St Andrews. In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field. One book item is created for each holding to be used to link to the corresponding borrowing records.
For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372″>https://arts.st-andrews.ac.uk/biographical-register/data/documents/1377913372</a></p>’). All borrowers are linked to records as ‘Main’ borrowers.
During the processing I noticed that the fourth ledger had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table. I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page. After my script had finished running we had 11147 borrowing records, 996 borrowers and 6395 book holding records for St Andrew in the system.
I then moved onto looking at the data for Selkirk library. This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers. Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated. The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.
As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records. There are 612 Holding records for Selkirk. I also processed the borrower records, resulting in 86 borrower records being added. I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.
Also this week I advised Katie on a file naming convention for the digitised images of pages that will be created for the project. I recommended that the filenames shouldn’t have spaces in them as these can be troublesome on some operating systems and that we’d want a character to use as a delimiter between the parts of the filename that wouldn’t appear elsewhere in the filename so it’s easy to split up the filename. I suggested that the page number should be included in the filename and that it should reflect the page number as it will be written into the database – e.g. if we’re going to use ‘r’ and ‘v’ these would be included. Each page in the database will be automatically assigned an auto-incrementing ID, and the only means of linking a specific page record in the database with a specific image will be via the page number entered when the page is created, so if this is something like ‘23r’ then ideally this should be represented in the image filename.
Katie had wondered about using characters to denote ledgers and pages in the filename (e.g. ‘L’ and ‘P’) but if we’re using a specific delimiting character to separate parts of the filename then using these characters wouldn’t be necessary and I suggested it would be better to not use ‘L’ as a lower case ‘l’ is very easy to confuse with a ‘1’ or a capital ‘I’ which might confuse future human users.
Instead I suggested using a ‘-‘ instead of spaces and a ‘_’ as a delimiter and pointed out that we should ensure that no other non-alphanumeric characters are ever used in the filename – no apostrophes, commas, colons, semi-colons, ampersands etc and to make sure the ‘-‘ is really a minus sign and not one of the fancy dashes (–) that get created by MS Office. This shouldn’t be an issue when entering a filename, but might be if a list of filenames is created in Word and then pasted into the ‘save as’ box, for example.
Finally, I suggested that it might be best to make the filenames entirely lower case, as some operating systems are case sensitive and if we don’t specify all lower case then there may be variation in the use of case. Following these guidelines the filenames would look something like this:
In addition to the Books and Borrowing project I worked on a number of other projects this week. I gave Matthew Creasy some further advice on using forums in his new project website, and ‘Scottish Cosmopolitanism at the Fin de Siècle’ website is now available here: https://scoco.glasgow.ac.uk/.
I also worked a bit more on using dates from the OED data in the Historical Thesaurus. Fraser had sent me a ZIP file containing the entire OED dataset as 240 XML files and I began analysing these to figure out how we’d extract these dates so that we could use them to update the dates associated with the lexemes in the HT. I needed to extract the quotation dates as these have ‘ante’ and ‘circa’ notes, plus labels. I noted that in addition to ‘a’ and ‘c’ a question mark is also used, somethings with an ‘a’ or ‘c’ and sometimes without. I decided to process things as follows:
- ?a will just be ‘a’
- ?c will just be ‘c’
- ? without an ‘a’ or ‘c’ will be ‘c’.
I also noticed that a date may sometimes be a range (e.g. 1795-8) so I needed to include a second date column in my data structure to accommodate this. I also noted that there are sometimes multiple Old English dates, and the contents of the ‘date’ tag vary depending on the date – sometimes the content is ‘OE’ and othertimes ‘lOE’ or ‘eOE’. I decided to process any OE dates for a lexeme as being 650 and to have only one OE date stored, so as to align with how OE dates are stored in the HT database (we don’t differentiate between date for OE words).
While running my date extraction script over one of the XML files I also noticed that there were lexemes in the OED data that were not present in the OED data we had previously extracted. This presumably means the dataset Fraser sent me is more up to date than the dataset I used to populate our online OED data table. This will no doubt mean we’ll need to update our online OED table, but as we link to the HT lexeme table using the OED catid, refentry, refid and lemmaid fields if we were to replace the online OED lexeme table with the data in these XML files the connections from OED to HT lexemes would be retained without issue (hopefully), but any matching processes we performed would need to be done again for the new lexemes.
I set my extraction script running on the OED XML files on Wednesday and processing took a long time. The script didn’t complete until sometime during Friday night, but after it had finished it had processed 238,699 categories, 754,285 lexemes, generating 3,893,341 date rows. It also found 4,062 new words in the OED data that it couldn’t process because they don’t exist in our OED lexeme database.
I also spent a bit more time working on some scripts for Fraser’s Scots Thesaurus project. The scripts now ignore ‘additional’ entries and only include ‘n.’ entries that match an HT ‘n’ category. Variant spellings are also removed (these were all tagged with <form> and I removed all of these). I also created a new field to store only the ‘NN_’ tagged words and remove all others.
The scripts generated three datasets, which I saved as spreadsheets for Fraser. The first (postagged-monosemous-dost-no-adds-n-only) contains all of the content that matches the above criteria. The second (postagged-monosemous-dost-no-adds-n-only-catheading-match) lists those lexemes where a postagged word fully matches the HT category heading. The final (postagged-monosemous-dost-no-adds-n-only-catcontents-match) lists those lexemes where a postagged word fully matches a lexeme in the HT category. For this table I’ve also added in the full list of lexemes for each HT category too.
I also spent a bit of time working on the Data Management Plan for the new project for Jane Stuart-Smith and Eleanor Lawson at QMU and arranged for a PhD student to get access to the TextGrid files that were generated for the audio records for the SCOTS Corpus project.
Finally, I investigated the issue the DSL people are having with duplicate child entries appearing in their data. This was due to something not working quite right in a script Thomas Widmann had written to extract the data from the DSL’s editing system before he left last year, and Ann had sent me some examples of where the issue was cropping up.
I have the data that was extracted from Thomas’s script last July as two XML files (dost.xml and snd.xml) and I looked through these for the examples Ann had sent. The entry for snd13897 contains the following URLs:
The first is the ID for the main entry and the other two are child entries. If I search for the second one (snds3788) this is the only occurrence of the ID in the file, as the child entry has been successfully merged. But if I search for the third one (sndns2217) I find a separate entry with this ID (with more limited content). The pulling in of data into a webpage in the V3 site uses URLs stored in a table linked to entry IDs. These were generated from the URLs in the entries in the XML file (see the <url> tags above). For the URL ‘sndns2217’ the query finds multiple IDs, one for the entry snd13897 and another for the entry sdnns2217. But it finds snd13897 first, so it’s the content of this entry that is pulled into the page.
The entry for dost16606 contains the following URLs:
(in addition to headword URLs). Searching for the second one discovers a separate entry with the ID dost50272 (with more limited content). As with SND, searching the URL table for this URL finds two IDs, and as dost16606 appears first this is the entry that gets displayed.
What we need to do is remove the child entries that still exist as separate entries in the data. To do this I could is write a script that would go through each entry in the dost.xml and snd.xml files. It would then pick out every <url> that is not the same as the entry ID and search the file to see if any entry exists with this ID. If it does then presumably this is a duplicate that should then be deleted. I’m waiting to hear back from the DSL people to see how we should proceed with this.
As you can no doubt gather from the above, this was a very busy week but I do at least feel that I’m getting on top of things again.
I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.
I was on holiday last week and had quite a stack of things to do when I got back in on Monday. This included setting up a new project website for a student in Scottish Literature who had received some Carnegie funding for a project and preparing for an interview panel that I was on, with the interview taking place on Friday. I also responded to Alison Wiggins about the content management system I’d created for her Mary Queen of Scots letters project and had a discussion with someone in Central IT Services about the App and Play Store accounts that I’ve been managing for several years now. It’s looking like responsibility for this might be moving to IT services, which I think makes a lot of sense. I also gave some advice to a PhD student about archiving and preserving her website and engaged in a long email discussion with Heather Pagan of the Anglo-Norman Dictionary about sorting out their data and possibly migrating it to a new system. On Wednesday I had a meeting with the SCOSYA team about further developments of the public atlas. We decided on another few requirements and discussed timescales for the completion of the work. They’re hoping to be able to engage in some user testing in the middle of September, so I need to try and get everything completed before then. I had hoped to start on some of this on Thursday, but I was struck down by a really nasty cold that I’ve still not shaken yet, which made focussing on such tricky tasks as getting questionnaire areas to highlight when clicked on rather difficult.
I spent most of the rest of the week working for DSL in various capacities. I’d put in a request to get Apache Solr installed on a new server, so we could use this for free-text searching and thankfully Arts IT Support agreed to do this. A lot of my week was spent preparing the data from both the ‘v2’ version of the DSL (the data outputted from the original API, but with full quotes and everything pre-generated rather than being created on the fly every time an entry is requested) and the ‘v3’ API (data taken from the editing server and outputted by a script written by Thomas Widmann) so that it could be indexed by Solr. Raymond from Arts IT Support set up an instance of Solr on a new server and I created scripts that went through all 90,000 DSL entries in both versions and generated full-text versions of the entries that stripped out the XML tags. For each set I created three versions – on that was ‘full text’, one that was full text without the quotations and the other that was just the quotations. The script outputted this data in a format that Solr could work with and I sent this on to Raymond for indexing. The first test version I sent Raymond was just the full text, and Solr managed to index this without incident. However, the other views of the text required working with the XML a bit, and this appears to have brought in some issues with special characters that Solr is not liking. I’m still in the middle of sorting this out and will continue to look into it next week, but progress with the free-text searching is definitely being made and it looks like the new API will be able to offer the same level of functionality as the existing API. I also ensured I documented the process of generating all of the data from the XML files outputted by the editing system through to preparing the full-text for indexing by Solr, so next time we come to update the data we will know exactly what to do. This is much better than how things previously stood, as the original API is entirely a ‘black box’ with no documentation whatsoever as to how to update the data contained therein.
Also during this time I engaged in an email conversation about managing the dictionary entries and things like cross references with Ann Ferguson and the people who will be handling the new editor software for the dictionary, and helped to migrate control for the email part of the DSL domain to the control of the DSL’s IT people. We’re definitely making progress with sorting out the DSL’s systems, which is really great.
I’m going to be working for just three days over the next two weeks, and all of these days will be out of the office, so I’ll just need to see how much time I have to continue with the DSL tasks, especially as work for the SCOSYA project is getting rather urgent.
Monday was a holiday this week, so Tuesday was the start of my working week. I spent about half the day completing work on the Data Management Plan that I had been asked to write by the College of Arts research people, and the remainder of the day continuing to write scripts the help in the linkup of HT and OED lexeme data. The latest script gets all unmatched HT words that are monosemous within part of speech in the unmatched dataset. For each of these the script then retrieves all OED words where the stripped form matches, as does POS, but the words are already matched to a different HT lexeme. If there is more than one OED lexeme matched to the HT lexeme I’ve added the information on subsequent rows in the table, so that full OED category information can more easily be read. I’m not entirely sure what this script will be used for, but Fraser seems to think it will be useful in automatically pinpointing certain words that the OED are currently trying to manually track down.
During the week I also made some further updates to a couple of song stories for the RNSN project and had a meeting over the phone with Kirsteen McCue about a new project she’s in the planning stages for at the moment, and which I will be helping with the technical aspects for. I also had a meeting with PhD student Ewa Wanat about a website she’s putting together and gave her some advice about things.
The rest of my week was split between DSL and SCOSYA. For DSL I spent time answering a number of emails. I then went through the SND data that had been outputted by Thomas Widmann’s scripts on the DSL server. I had tried running this data through the script I’d written to take the outputted XML data and insert it into our online MySQL database, but my script was giving errors, stating that the input file wasn’t valid XML. I loaded the file (an almost 90Mb text file) into Oxygen and asked it to validate the XML. It took a while, but managed to find one easily identifiable error, and one error that was trickier to track down.
In the entry for snd22907 there was a closing </sense> tag in the ‘History’ that has no corresponding opening tag. This was easy to track down and manually fix. Entry snd12737 had two opening tags (<entry id=”snd12737″>) one below the <meta> tag. This was trickier to find as I needed to manually track it down by chopping the file in half, checking which half the error was in, chopping this bit in half and so on until I ended up with a very small file in which it was easy to locate the problem.
With the SND data fixed I could then run it through my script. However, I wanted to change the way the script worked based on feedback from Ann last week. Previously I had added new fields to a test version of the main database, and the script found the matching row and inserted new data. I decided instead to create an entirely new table for the new data, to keep things more cleanly divided, and to handle the possibility of there being new entries in the data that were not present in the existing database. I also needed to update the way in which the URL tag was handled, as Ann had explained that there could be any number of URL tags, with them referencing other entries that has been merged with the current one. After updating my test version of the database to make new tables and fields, and updating my script to take these changes into consideration I ran both the DOST and the SND data through the script, resulting in 50,373 entries for DOST and 34,184 entries for SND. This is actually less entries than in the old database. There are 3023 missing SND entries and 1994 missing DOST entries. They are all supplemental entries (with IDs starting ‘snds’ in SND and ‘adds’ in DOST). This leaves just 24 DOST ‘adds’ entries in the Sienna data and 2730 SND ‘snds’ entries. I’m not sure what’s going on with the output – whether the omission of these entries is intentional (because the entries have been merged with regular entries) or whether this is an error, but I have exported information about the missing rows and have sent these on to Ann for further investigation.
For SCOSYA I focussed on adding in the sample sound clips and groupings for location markers. I also engaged with some preparations for the project’s ‘data hack’ that will be taking place in mid June. Adding in sound clips took quite a bit of time, as I needed to update both the Content Management System to allow sound clips to be uploaded and managed, and the API to incorporate links to the uploaded sound clips. This is in addition to incorporating the feature into the front-end.
Now if a member of staff logs into the CMS and goes to the ‘Browse codes’ page they will see a new column that lists the number of sound clips associated with a code. I’ve currently uploaded the four for E1 ‘This needs washed’ for test purposes. From the table, clicking on a code loads its page, which now includes a new section for sound clips. Any previously uploaded ones are listed and can be played or deleted. New clips in MP3 format can also be uploaded here, with files being renamed upon upload, based on the code and the next free auto-incrementing number in the database.
In the API all soundfiles are included in the ‘attributes’ endpoint, which is used by the drop-down list in the atlas. The public atlas has also been updated to include buttons to play any sound clips that are available, as the screenshot towards the end of this post demonstrates.
There is now a new section labelled ‘Listen’ with four ‘Play’ icons. Pressing on one of these plays a different sound clip. Getting these icons to work has been more tricky than you might expect. HTML5 has a tag called <audio> that browsers can interpret in order to create their own audio player which is then embedded in the page. This is what happens in the CMS. Unfortunately an interface designer has no control over the display of the player – it’s different in every browser and generally takes up a lot of room, which we don’t really have. I initially just used the HTML5 audio player but each sound clip then had to appear on a new row and the player in Chrome was too wide for the side panel.
I then moved on to looking at the groups for locations. This will be a fixed list of groups, taken from ones that the team has already created. I copied these groups to a new location in the database and updated the API to create new endpoints for listing groups and bringing back the IDs of all locations that are contained within a specified group. I haven’t managed to get the ‘groups’ feature fully working yet, but the selection options are now in place. There’s a ‘groups’ button in the ‘Examples’ section, and when you click on this a section appears listing the groups. Each group appears as a button. When you click on one a ‘tick’ is added to the button and currently the background turns a highlighted green colour. I’m going to include several different highlighted colours so the buttons light up differently. Although it doesn’t work yet, these colours will then be applied to the appropriate group of points / areas on the map. You can see an example of the buttons below:
The only slight reservation I have is that this option does make the public atlas more complicated to use and more cluttered. I guess it’s ok as the groups are hidden by default, though. There also may be an issue with the sidebar getting too long for narrow screens that I’ll need to investigate.
This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
I spent a fair amount of time this week working on the REELS project, which began last week. I set up a basic WordPress powered project website and got some network drive space set up and then on Wednesday we had a long meeting where we went over some of the technical aspects of the project. We discussed the structure of the project website and also the structure of the database that the project will require in order to record the required place-name data. I spent the best part of Thursday writing a specification document for the database and content management system which I sent to the rest of the project team for comment on Thursday evening. Next week I will update this document based on the team’s comments and will hopefully find the time to start working on the database itself.
I met with a PhD student this week to discuss online survey tools that might be suitable for the research that she was hoping to gather. I heard this week from Bryony Randall in English Literature that an AHRC proposal that I’d given her some technical advice on had been granted funding, which is great news. I had a brief meeting with the SCOSYA team this week too, mainly to discuss development of the project website. We’re still waiting on the domain being activated, but we’re also waiting for a designer to finish work on a logo for the project so we can’t do much about the interface for the project website until we get this anyway.
I also attended the ‘showcase’ session for the Digging into Data conference that was taking place at Glasgow this week. The showcase was an evening session where projects had stalls and could speak to attendees about their work. I was there with the Mapping Metaphor project, along with Wendy, Ellen and Rachael. We had some interesting and at times pretty in-depth discussions with some of the attendees and it was a good opportunity to see the sorts of outputs other projects have created with their data.
Before the event I went through the website to remind myself of how it all worked and managed to uncover a bug in the top-level visualisation: When you click on a category yellow circles appear at the categories the one you’ve clicked on have a connection to. These circles represent the number of metaphorical connections between the two categories. What I noticed was that the size of the circles was not taking into consideration the metaphor strength that had been selected, which was giving confusing results. E.g. if there are 14 connections but only one of these is ‘strong’ and you’ve selected to view only strong metaphors the circle size was still being based on 14 connections rather than one. Thankfully I managed to track down the cause of the error and I fixed it before the event.
I also spent a little bit of time further investigating the problems with the Curious Travellers server, which for some reason is blocking external network connections. I was hoping to install a ‘captcha’ on the contact form to cut down on the amount of spam that was being submitted and the Contact Form 7 plugin has a facility to integrated Google’s ‘reCaptcha’ service. This looked like it was working very well, but for some reason when ‘reCaptcha’ was added to forms these forms failed to submit, instead giving error messages in a yellow box. The Contact Form 7 documentation suggests that a yellow box means the content has been marked as spam and therefore won’t send, but my message wasn’t spam. Removing ‘reCaptcha’ from the form allowed it to submit without any issue. I tried to find out what was causing this but have been unable to find an answer. I can only assume it is something to do with the server blocking external connections and somehow failing to receive a ‘message is not spam’ notification from the service. I think we’re going to have to look at moving the site to a different server unless Chris can figure out what’s different about the settings on the current one.
My final project this week was SAMUELS, for which I am continuing to work on the extraction of the Hansard data. Last week I figured out how to run a test job on the Grid and I split the gigantic Hansard text file into 5000 line chunks for processing. This week I started writing a shell script that will be able to process these chunks. The script needs to do the same tasks as my initial PHP script, but because of the setup of the Grid I need to write a script that will run directly in the Bash shell. I’ve never done much with shell scripting so it’s taken me some time to figure out how to write such a script. So far I have managed to write a script that takes a file as an input, goes through each line at a time, splits the line up into two sections based on the tab character, base64 decodes each section and then extracts the parts of the first section into variables. The second section is proving to be a little trickier as the decoded content includes line breaks which seem to be ignored. Once I’ve figured out how to work with the line breaks I should then be able to isolate each tag / frequency pair, write the necessary SQL insert statement and then write this to an output file. Hopefully I’ll get this sorted next week.