I had two Zoom calls this week, the first on Wednesday with Kirsteen McCue to discuss a new, small project to publish a selection of musical settings to Burns poems and the second on Friday with Joanna Kopaczyk and her RA on the Scots Language Policy project to give a tutorial on how to use WordPress.
The majority of my week was divided between the Anglo-Norman Dictionary, the Dictionary of the Scots Language and the Place-names of Iona projects. For the AND I made a few tweaks to the static content of the site and migrated some more blog posts across to the new site (these are not live yet). I also added commentaries to more than 260 entries, which took some time to test. I also worked on the DTD file that the editors reference from their XML editing software to ensure that all of the elements and attributes found within commentaries are ‘allowed’ in the XML. Without doing this it was possible to add the tags in, but this would give errors in the editing software. I also batch updated all of the entries on the site to reference the new DTD and exported all of the files, zipped them up and sent them to the editors so they can work on them as required. I also began to think about migrating the TextBase from the old site to the new one, and managed to source the XML files that comprise this system. It looks like it may be quite tricky to work with these as there are more than 70 book-length XML files to deal with and so far I have not managed to locate the XSLT that was originally used to process these files.
For the DSL I completed work on the new bibliography search pages that use the new ‘V4’ data. These pages allow the authors and titles of bibliographical items to be searched, results to be viewed and individual items to be displayed. I also made some minor tweaks to the live site and had a discussion with Ann Fergusson about transferring the project’s data to the people who have set up a new editing interface for them, something I’m hoping to be able to tackle next week.
For the Place-names of Iona project I had a discussion about implementing a new ‘work of the month’ feature and spent quite a bit of time investigating using 10-digit OS grid references in the project’s CMS. The team need to use up to 10-digit grid references to get 1m accuracy for individual monuments, but the library I use in the CMS to automatically generate latitude and longitude from the supplied grid reference will only work with a 6-digit NGR. The automatically generated latitude and longitude are then automatically passed to Google Maps to ascertain the altitude of the location and all of this information is stored in the database whenever a new place-name record is created or an existing record is edited.
As the library currently in use will only accept 6-digit NGRs I had to do a bit of research into alternative libraries, and I managed to find one that can accept NGRs of 2,4,6,8 or 10 digits. Information about the library, including text boxes where you can enter an NGR and see the results can be found here: http://www.movable-type.co.uk/scripts/latlong-os-gridref.html along with an awful lot of description about the calculations and some pretty scary looking formulae.
This does mean the person filling out the form can see the generated latitude and longitude and also tweak it if required before submitting the form, which is a potentially useful thing. I may even be able to add a Google Map to the form so you can see (and possibly tweak) the point before submitting the form, but I’ll need to look into this further. I also still need to work on the format of the latitude and longitude as the new library generates them with a compass point (e.g. 6.420848° W) and we need to store them as a purely decimal value (e.g. -6.420848) with ‘W’ and ‘S’ figures being negatives.
However, whilst researching this I discovered a potentially worrying thing that needs discussion with the wider team. The way the Ordnance Survey generates latitude and longitude from their grid references was changed in 2014. Information about this can be found in the page linked to above in the ‘Latitude/longitudes require a datum’ section. Previously the OS used ‘OSGB-36’ to generate latitude and longitude, but in 2014 this was changed to ‘WGS84’, which is used by GPS systems. The difference in the latitude / longitude figures generated by the two systems is about 100 metres, which is quite a lot if you’re intending to pinpoint individual monuments.
The new library has facilities to generate latitude and longitude using either the new or old systems, but defaults to the new system. I’ve checked the output of the library we currently use and it uses the old ‘OSGB-36’ system. This means all of the place-names in the system so far (and all those for the previous projects) have latitudes and longitudes generated using the now obsolete (since 2014) system. To give an example of the difference, the place-name A’ Mhachair in the CMS has this location: https://www.google.com/maps/place/56%C2%B019’33.2%22N+6%C2%B025’11.4%22Wfirstname.lastname@example.org,-6.422022,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325885!4d-6.419828 and with the newer ‘WGS84’ system it would have this location: https://www.google.com/maps/place/56%C2%B019’32.7%22N+6%C2%B025’15.1%22Wemail@example.com,-6.4230367,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325744!4d-6.420848
So what we need to decide before I replace the old library with the new one in the CMS is whether we switch to using ‘WGS84’ or we keep using ‘OSGB-36’. As I say, this will need further discussion before I implement any changes.
Also this week I responded to a query from Cris Sarg of the Medical Humanities Network project, spoke to Fraser Dallachy about future updates to the HT’s data from the OED, made some tweaks to the structure of the SCOSYA website for Jennifer Smith, added a plugin to the Editing Burns site for Craig Lamont and had a chat with the Books and Borrowing people about cleaning the authors data, importing the Craigston data and how to deal with a lot of borrowers that were excluded from the Selkirk data that I previously imported.
Next week I’ll be on holiday from Monday to Wednesday to cover the school half term.
I headed into the University for the first time this year on Wednesday this week to collect a new iPad that I’d ordered and to get some files from my office. It was great to see the old place again, but it did take quite a chunk out of my day to travel there and back, especially as I’m still home-schooling either a morning or an afternoon each day at the moment too.
As with last week, I mainly divided my time this week between the Dictionary of the Scots Language, the Anglo-Norman Dictionary and the Books and Borrowing project, with a few other bits and bobs added in as well. For the DSL I retrieved the source code for my original Scots School Dictionary app from my office so we can host this somewhere on the DSL website. This is because the DSL have commissioned someone else to make a new School Dictionary app, which launched this week, but doesn’t include an ‘English to Scots’ feature as the old app does, so we’re going to make the old app available as a website for those people who miss the feature. I also made a few minor tweaks to the main DSL site, and then focussed on adding bibliography search facilities to the new version of the API, a task that I’d begun last week.
I created a new table for the bibliographical data that includes the various fields used for DOST (note, author, editor, date, longtitle etc) and a field for the XML data used for SND. I then created two further tables for searching, one that contains every author and editor name for each item (for DOST there may be different names in the author, editor, longauthor and longeditor fields while for SND there may be any number of <author> tags) and the other containing every title for each item (DOST may have different text in title and longtitle while SND items can have any number of <title> tags). These tables allow you to search for any variant author, editor or title and find the item.
I also created two additional fields in the bibliography table that contain the ‘display author’ and ‘display title’. These are the forms that get displayed in the search results before you click on an item to open the full bibliographical entry. I then updated the V4 API to add in facilities to search and retrieve the bibliographies. I didn’t have the time to connect to this API and to implement the search on the Sienna test site, which is something I hope to do next week, but the logic behind the search and display of bibliographies is all there. There is a predictive search that will be used to generate the autocomplete list, similar to how the live site currently works: You will be able to select whether your search is for authors, titles or both and when you start typing in some text a list of matching items will appear, e.g. typing in ‘ham’ for authors in both dictionaries will display the following all items containing ‘ham’ and when you select an item this will then perform a search for the specific text. You will then be able to click on an item to view the full bibliography. This is a bit different to how the live site currently works, as with these if you enter ‘ham’ and select (for example) ‘Hamilton, J,’ from the autocomplete list you are taken directly to a page that lists all of the items for the author. However, we can’t do that any more as we no longer have unique identifiers that group bibliographical items by author. I may be able to do something similar with the page that comes up when you select an author, but this would have to rely on the name to group items together and a name may not be unique.
For the AND I made some tweaks to the website, such as adding a link to the search page if you type some text into the ‘jump to entry’ option and no matching entries are found. I then spent the rest of my time continuing to develop the new content management system, specifically the pages for managing source texts. I finished work on this, adding in facilities to add, edit, browse and delete source texts from the database. I then migrated the DTD to the new site, which is referenced by the editors’ XML editor when they work on the entry XML files. The DTD on the old server referenced several lists of things that are then used to populate drop-down lists of options in the XML editor. I migrated these too, making them dynamically generated from the underlying database rather than statis lists, meaning when (for example) new source texts are added to the CMS these will automatically become available when using the XML editor.
For the Books and Borrowing project I participated in the project’s Zoom call on Monday to discuss the project’s CMS and how to amalgamate the various duplicate author records that resulted from data uploads from different libraries. After the call I made some required changes to the CMS, such as making the editor’s notes fields visible by default again, and worked on the duplicate authors matching script to add in further outputs when comparing the author names with Levenshtein ratings of 1 and 2. I also reviewed some content that was sent to us from another library.
Also this week I responded to an email from James Caudle in Scottish Literature about a potential project he’s setting up, made a couple of changes to the Scots Language Policy website, made some tweaks to the menu structure for the Scots Syntax Atlas project and gave some advice to a post-grad student who had contacted me about setting up a corpus.
This was my first full week back of the year, although it was also the first week of a return to homeschooling, which made working a little trickier than usual. I also had a dentist’s appointment on Tuesday and lost some time to that due to my dentist being near the University rather than where I live. However, despite these challenges I was able to achieve quite a lot this week. I had two Zoom calls, the first on Monday to discuss a new ESRC grant that Jane Stuart-Smith is putting together with colleagues at Strathclyde while the second on Wednesday was with a partner in Joanna Kopaczyk’s new RSC funded project about Scots Language Policy to discuss the project’s website and the survey they’re going to put out. I also made a few tweaks to the DSL website, replied to Kirsteen McCue about the AHRC proposal she’s currently putting together, replied to a query regarding the technologies behind the Scots Syntax Atlas, made a few further updates to the Burns Supper map and replied to a query from Rachel Fletcher in English Language about lemmatising Old English.
Other than these various tasks I split my time between the Anglo-Norman Dictionary and the Books and Borrowing projects. For the former I completed adding explanatory notes to all of the ‘Introducing the AND’ pages. This was a very time consuming task as there were probably about 150 explanatory notes in total to add in, each appearing in a Bootstrap dialog box, and each requiring me to copy the note form the old website, add in any required HTML formatting, find and check all of the links to AND entries on the old site and add these in as required. It was pretty tedious to do, but it feels great to get it done, as the notes were previously just giving 404 errors on the new live site, and I don’t like having such things on a site I’m responsible for. I also migrated the academic articles from the old site to the new one (https://anglo-norman.net/articles/) which also required some manual formatting of the content. There are five other articles that I haven’t managed to migrate yet as they are full of character encoding errors on the old site. Geert is looking for copies of these articles that actually work and I’ll add them in once he’s able to get them to me. I also begin migrating the blog posts to the new site too. Currently the blog is hosted on Blogspot and there are 55 entries, but we’d like these to be an internal part of the new site. Migrating these is going to take some time as it means copying the text (which thankfully retains formatting) and then manually saving and embedding any images in the posts. I’m just going to do a few of these a week until they’re all done and so far I’ve migrated seven. I also needed to look into how the blogs page works in the WordPress theme I created for the AND, as to start with the page was just listing the full text of every post rather than giving summaries and links through to the full text of each. After some investigation I figured out that in my theme there is a script called ‘home.php’ and this is responsible for displaying all of the blog posts on the ‘blog’ page. It in turn calls another template called ‘content-blog.php’ which was previously set to display the full content of the blog. Instead I set it to display the title as a link through to the full post, the date and then an excerpt from the full blog, which can be accessed through a handy WordPress function called ‘the_excerpt()’.
For the Books and Borrowing project I made some improvements and fixes to the Content Management System. I’d been meaning to enhance the CMS for some time, but due to other commitments to other projects I didn’t have the time to delve into it. It felt good to find the time to return to the project this week.
I updated the ‘Books’ and ‘Borrowers’ tabs when viewing a library in the CMS. I added in pagination to speed up the loading of the pages. Pages are now split into 500 record blocks and you can navigate between pages using the links above and below the tables. For some reason the loading of the page is still a bit slow on the Stirling server whereas it was fine on the Glasgow server I was using for test purposes. I’m not entirely sure why as I’d copied the database over too – presumably the Stirling server is slower. However, it is still a massive improvement on the speed of the page previously.
I also changed the way tables scroll horizontally. Previously if a table was wider than the page a scrollbar appeared above and below the table, but this was rather awkward to use if you were looking at the middle of the table (you had to scroll up or down to the beginning or end of the table, then use the horizontal scrollbar to move the table along a bit, then navigate back to the section of the page you were interested in). Now the scrollbar just appears at the bottom of the browser window and can always be accessed no matter where in the table you are.
I also removed the editorial notes from tables by default to reduce clutter, and added in a button for showing / hiding the editors’ notes near the top of each page. I also added a limit option in the ‘Books’ and ‘Borrowers’ pages within a library to limit the displayed records to only those found in a specific ledger. I added in a further option to display those records that are not currently associated with any ledgers too.
I then deleted the ‘original borrowed date’ and ‘original returned date’ fields in St Andrews data as these were no longer required. I deleted these additional fields from the system and all data that were contained in these fields.
It had been noted that the book part numbers were not being listed numerically. As part numbers can contain text as well as numbers (e.g. ‘Vol. II’), this field in the database needed to be set as text rather than an integer. Unfortunately the database doesn’t order numbers correctly when they are contained in a non-numerical field – instead all the ones come first (1, 10, 11) then all the twos (2, 20, 22) etc. However, I managed to find a way to ensure that the numbers are ordered correctly.
I also fixed the ‘Add another Edition/Work to this holding’ button that was not working. This was caused by the Stirling server running a different version of PHP that doesn’t allow functions to have variable numbers of arguments. The autocomplete function was also not working at edition level and I investigated this. The issue was being caused by tab characters appearing in edition titles, and I updated my script to ensure these characters are stripped out before the data is formatted as JSON.
There may be further tweaks to be made – I’ll need to hear back from the rest of the team before I know more, but for now I’m up to date with the project. Next week I intend to get back into some of the larger and more trickier outstanding AND tasks (of which there are, alas, many) and to begin working towards adding the DSL bibliography data into the new version of the API.
Week 16 of Lockdown and still working from home. I continued working on the data import for the Books and Borrowers project this week. I wrote a script to import data from Haddington, which took some time due to the large number of additional fields in the data (15 across Borrowers, Holdings and Borrowings), but are executing it resulted in a further 5,163 borrowing records across 2 ledgers and 494 pages being added, including 1399 book holding records and 717 borrowers.
I then moved onto the datasets from Leighton and Wigtown. Leighton was a much smaller dataset, with just 193 borrowing records over 18 pages in one ledger and involving 18 borrowers and 71 books. As before, I have just created book holding records for these (rather than project-wide edition records), although in this case there are authors for books too, which I have also created. Wigtown was another smaller dataset. The spreadsheet has three sheets, the first is a list of borrowers, the second a list of borrowings and the third a list of books. However, no unique identifiers are used to connect the borrowers and books to the information in the borrowings sheet and there’s no other field that matches across the sheets to allow the data to be automatically connected up. For example, in the Books sheet there is the book ‘History of Edinburgh’ by author ‘Arnot, Hugo’ but in the borrowings tab author surname and forename are split into different columns (so ‘Arnot’ and ‘Hugo’ and book titles don’t match (in this case the book appears as simply ‘Edinburgh’ in the borrowings). Therefore I’ve not been able to automatically pull in the information from the books sheet. However, as there are only 59 books in the books sheet it shouldn’t take too much time to manually add the necessary data when created Edition records. It’s a similar issue with Borrowers in the first sheet – they appear with name in one column (e.g. ‘Douglas, Andrew’) but in the Borrowings sheet the names are split into separate forename and surname columns. There are also instances of people with the same name (e.g. ‘Stewart, John’) but without unique identifiers there’s no way to differentiate these. There are only 110 people listed in the Borrowers sheet, and only 43 in the actual borrowing data, so again, it’s probably better if any details that are required are added in manually.
I imported a total of 898 borrowing records for Wigtown. As there is no page or ledger information in the data I just added these all to one page in a made-up ledger. It does however mean that the page can take quite a while to load in the CMS. There are 43 associated borrowers and 53 associated books, which again have been created as Holding records only and have associated authors. However, there are multiple Book Items created for many of these 53 books – there are actually 224 book items. This is because the spreadsheet contains a separate ‘Volume’ column and a book may be listed with the same title but a different volume. In such cases a Holding record is made for the book (e.g. ‘Decline and Fall of Rome’) and an Item is made for each Volume that appears (in this case 12 items for the listed volumes 1-12 across the dataset). With these datasets imported I have now processed all of the existing data I have access to, other than the Glasgow Professors borrowing records, but these are still being worked on.
I did some other tasks for the project this week as well, including reviewing the digitisation policy document for the project, which lists guidelines for the team to follow when they have to take photos of ledger pages themselves in libraries where no professional digitisation service is available. I also discussed how borrower occupations will be handled in the system with Katie.
In addition to the Books and Borrowers project I found time to work on a number of other projects this week too. I wrote a Data Management Plan for an AHRC Networking proposal that Carolyn Jess-Cooke in English Literature is putting together and I had an email conversation with Heather Pagan of the Anglo-Norman Dictionary about the Data Management Plan she wants me to write for a new AHRC proposal that Glasgow will be involved with. I responded to a query about a place-names project from Thomas Clancy, a query about App certification from Brian McKenna in IT Services and a query about domain name registration from Eleanor Lawson at QMU. Also (outside of work time) I’ve been helping my brother-in-law set up Beacon Genealogy, through which he offers genealogy and family history research services.
Also this week I worked with Jennifer Smith to make a number of changes to the content of the SCOSYA website (https://scotssyntaxatlas.ac.uk/) to provide more information about the project for REF purposes and I added a new dataset to the interactive map of Burns Suppers that I’m creating for Paul Malgrati in Scottish Literature. I also went through all of the WordPress sites I manage and upgraded them to the most recent version of WordPress.
Finally, I spent some time writing scripts for the DSL people to help identify child entries in the DOST and SND datasets that haven’t been properly merged with main entries when exported from their editing software. In such cases the child entries have been added to the main entries, but then they haven’t been removed as separate entries in the output data, meaning the child entries appear twice. When attempting to process the SND data I discovered there were some errors in the XML file (mismatched tags) that prevented my script from processing the file, so I had to spend some time tracking these down and fixing them. But once this had been done my script could do through the entire dataset, look for an ID that appeared as a URL in one entry and as an ID of another entry and in such cases pull out the IDs and the full XML of each entry and export it into an HTML table. There were about 180 duplicate child entries in DOST but a lot more in SND (the DOST file is about 1.5mb, the SND one is about 50mb). Hopefully once the DSL people have analysed the data we can then strip out the unnecessary child entries and have a better dataset to import into the new editing system the DSL is going to be using.
During week 11 of Lockdown I continued to work on the Books and Borrowing project, but also spent a fair amount of time catching up with other projects that I’d had to put to one side due to the development of the Books and Borrowing content management system. This included reading through the proposal documentation for Jennifer Smith’s follow-on funding application for SCOSYA, and writing a new version of the Data Management Plan based on this updated documentation and making some changes to the ‘export data for print publication’ facility for Carole Hough’s REELS project. I also spent some time creating as new export facility to format the place-name elements and any associated place-names for print publication too.
During this week a number of SSL certificates expired for a bunch of websites, which meant browsers were displaying scary warning messages when people visited the sites. I had to spend a bit of time tracking these down and passing the details over to Arts IT Support for them to fix as it is not something I have access rights to do myself. I also liaised with Mike Black to migrate some websites over from the server that houses many project websites to a new server. This is because the old server is running out of space and is getting rather temperamental and freeing up some space should address the issue.
I also made some further tweaks to Paul Malgrati’s interactive map of Burns’ Suppers and created a new WordPress-powered project website for Matthew Creasy’s new ‘Scottish Cosmopolitanism at the Fin de Siècle’ project. This included the usual choosing a theme, colour schemes and fonts, adding in header images and footer logos and creating initial versions of the main pages of the site. I’d also received a query from Jane Stuart-Smith about the audio recordings in the SCOTS Corpus so I did a bit of investigation about that.
Fraser Dallachy had got back to me with some further tasks for me to carry out on the processing of dates for the Historical Thesaurus, and I had intended to spend some time on this towards the end of the week, but when I began to look into this I realised that the scripts I’d written to process the old HT dates (comprising 23 different fields) and to generate the new, streamlined date system that uses a related table with just 6 fields were sitting on my PC in my office at work. Usually all the scripts I work on are located on a server, meaning I can easily access them from anywhere by connecting to the server and downloading them. However, sometimes I can’t run the scripts on the server as they may need to be left running for hours (or sometimes days) if they’re processing large amounts of data or performing intensive tasks on the data. In these cases the scripts run directly on my office PC, and this was the situation with the dates script. I realised I would need to get into my office at work on retrieve the scripts, so I put in a request to be allowed into work. Staff are not currently allowed to just go into work – instead you need to get approval from your Head of School and then arrange a time that suits security. Thankfully it looks like I’ll be able to go in early next week.
Other than these issues, I spent my time continuing to work for the Books and Borrowing project. On Tuesday we had a Zoom call with all six members of the core project team, during which I demonstrated the CMS as it currently stands. This gave me an opportunity to demonstrate the new Author association facilities I had created last week. The demonstration all went very smoothly and I think the team are happy with how the system works, although no doubt once they actually begin to use it there will be bugs to fix and workflows to tweak. I also spent some time before the meeting testing the system again, and fixing some issues that were not quite right with the author system.
I spent the remainder of my time on the project completing work on the facility to add, edit and view book holding records directly via the library page, as opposed to doing so whilst adding / editing a borrowing record. I also implemented a similar facility for borrowers as well. Next week I will begin to import some of the sample data from various libraries into the system and will allow the team to access the system to test it out.
This was the first full week of the Coronavirus lockdown and as such I was working from home and also having to look after my nine year-old son who is also at home on lockdown. My wife and I have arranged to split the days into morning and afternoon shifts, with one of us home-schooling our son while the other works during each shift and extra work squeezed in before and after these shifts. The arrangement has worked pretty well for all of us this week and I’ve managed to get a fair amount of work done.
This included spotting and requesting fixes for a number of other sites that had started to display scary warnings about their SSL certificates, working on an updated version of the Data Management Plan for the SCOSYA follow-on proposal, fixing some log-in and account related issues for the DSL people and helping Carolyn Jess-Cooke in English Literature with some technical issues relating to a WordPress blog she has set up for a ‘Stay at home’ literary festival (https://stayathomefest.wordpress.com/). I also had a conference call with Katie Halsey and Matt Sangster about the Books and Borrowers project, which is due to start at the beginning of June. It was my first time using the Zoom videoconferencing software and it worked very well, other than my cat trying to participate several times. We had a good call and made some plans for the coming weeks and months. I’m going to try and get an initial version of the content management system and database for the project in place before the official start of the project so that the RAs will be able to use this straight away. This is of even greater importance now as they are likely to be limited in the kinds of research activities they can do at the start of the project because of travel restrictions and will need to work with digital materials.
Other than these issues I divided my time between three projects. The first was the Burns Supper map for Paul Malgrati in Scottish Literature. Paul had sent me some images that are to be used in the map and I spent some time integrating these. The image appears as a thumbnail with credit text (if available) appearing underneath. If there is a link to the place the image was taken from the credit text appears as a link. Clicking on the image thumbnail opens the full image in a new tab. I also added links to the videos where applicable too, but I decided not to embed the videos in the page as I think these would be too small and there would be just too much going on for locations that have both videos and an image. Paul also wanted clusters to be limited by areas (e.g. a cluster for Scotland rather than these just being amalgamated with a big cluster for Europe when zooming out) and I investigated this. I discovered that it is possible to create groups of locations. E.g. have a new column in the spreadsheet named ‘cluster’ or something like that and all the ‘Scotland’ ones could have ‘Scotland’ here, or all the South American ones could have ‘South America’ here. These will then be the top level clusters and they will not be further amalgamated on zoom out. Once Paul gets back to me with the clusters he would like for the data I’ll update things further. Below is an image of the map with the photos embedded:
The second major project I worked on was the interactive map for Gerry McKeever’s Regional Romanticism project. Gerry had got back to me with a new version of the data he’d been working on and some feedback from other people he’d sent the map to. I created a new version of the map featuring the new data and incorporated some changes to how the map worked based on feedback, namely I moved the navigation buttons to the top of the story pane and have made them bigger, with a new white dividing line between the buttons and the rest of the pane. This hopefully makes them more obvious to people and means the buttons are immediately visible rather than people potentially having to scroll to see them. I’ve also replaced the directional arrows with thicker chevron icons and have changed the ‘Return to start’ button to ‘Restart’. I’ve also made the ‘Next’ button on both the overview and the first slide blink every few seconds, at Gerry’s request. Hopefully this won’t be too annoying for people. Finally I made the slide number bigger too. Here’s a screenshot of how things currently look:
I then decided to chain several questions together to make the quiz more fun. Once the correct answer is given a ‘Next’ button appears, leading to a new question. I set up a ‘max questions’ variable that controls how many questions there are (e.g. 3, 5 or 10) and the questions keep coming until this number is reached. When the number is reached the user can then view a summary that tells them which words and (correct) categories were included, provides links to the categories and gives the user an overall score. I decided that if the user guesses correctly the first time they should get one star. If they guess correctly a second time they get half a star and any more guesses get no stars. The summary and star ratings for each question are also displayed as the following screenshot shows:
It’s shaping up pretty nicely, but I still need to work on the script that exports data from the database. Identifying random categories that contain at least one non-OE word and are of the same part of speech as the first randomly chosen category currently means hundreds or even thousands of database calls before a suitable category is returned. This is inefficient and occasionally the script was getting caught in a loop and timing out before it found a suitable category. I managed to catch this by having some sample data that loads if a suitable category isn’t found after 1000 attempts, but it’s not ideal. I’ll need to work on this some more over the next few weeks as time allows.
Last week was a full five-day strike and the end of the current period of UCU strike action. This week I returned to work, but the Coronavirus situation, which has been gradually getting worse over the past few weeks ramped up considerably, with the University closed for teaching and many staff working from home. I came into work from Monday to Wednesday but the West End was deserted and there didn’t seem much point in me using public transport to come into my office when there was no-one else around so from Thursday onwards I began to work from home, as I will be doing for the foreseeable future.
Despite all of these upheavals and also suffering from a pretty horrible cold I managed to get a lot done this week. Some of Monday was spend catching up with emails that had come in whilst I had been on strike last week, including a request from Rhona Alcorn of SLD to send her the data and sound files from the Scots School Dictionary and responding to Alan Riach from Scottish Literature about some web pages he wanted updated (these were on the main University site and this is not something I am involved with updating). I also noticed that the version of this site that was being served up was the version on the old server, meaning my most recent blog posts were not appearing. Thankfully Raymond Brasas in Arts IT Support was able to sort this out. Raymond had also emailed me about some WordPress sites I mange that had out of date versions of the software installed. There were a couple of sites that I’d forgotten about, a couple that were no longer operational and a couple that had legitimate reasons for being out of date, so I got back to him about those, and also updated my spreadsheet of WordPress sites I manage to ensure the ones I’d forgotten about would not be overlooked again. I also became aware of SSL certificate errors on a couple of websites that were causing the sites to display scary warning messages before anyone could reach the sites, so asked Raymond to fix these. Finally, Fraser Dallachy, who is working on a pilot for a new Scots Thesaurus, contacted me to see if he could get access to the files that were used to put together the first version of the Concise Scots Dictionary. We had previously established that any electronic files relating to the printed Scots Thesaurus have been lost and he was hoping that these old dictionary files may contain data that was used in this old thesaurus. I managed to track the files down, but alas there appeared to be no semantic data in the entries found therein. I also had a chat with Marc Alexander about a little quiz he would like to develop for the Historical Thesaurus.
I spoke to Jennifer Smith on Monday about the follow-on funding application for her SCOSYA project and spent a bit of time during the week writing a first draft of a Data Management Plan for the application, after reviewing all of the proposal materials she had sent me. Writing the plan raised some questions and I will no doubt have to revise the plan before the proposal is finalised, but it was good to get a first version completed and sent off.
I also finished work on the interactive map for Gerry McKeever’s Regional Romanticism project this week. Previously I’d started to use a new plugin to get nice curved lines between markers and all appeared to be working well. This week I began to integrate the plugin with my map, but unfortunately I’m still encountering unusable slowdown with the new plugin. Everything works fine to begin with, but after a bit of scrolling and zooming, especially round an area with lots of lines, the page becomes unresponsive. I wondered whether the issue might be related to the midpoint of the curve being dynamically generated from a function I took from another plugin so instead made a version that generated and then saved these midpoints that could then be used without needing to be calculated each time. This would also have meant that we could have manually tweaked the curves to position them as desired, which would have been great as some lines were not ideally positioned (e.g. from Scotland to the US via the North Pole), but even this seems to have made little impact on the performance issues. I even tried turning everything else off (e.g. icons, popups, the NLS map) to see if I could identify another cause of the slowdown but nothing has worked. I unfortunately had to admit defeat and resort to using straight lines after all. These are somewhat less visually appealing, but they result in no performance issues. Here’s a screenshot of this new version:
With these updates in place I made a version of the map that would run directly on the desktop and sent Gerry some instructions on how to update the data, meaning he can continue to work on it and see how it looks. But my work on this is now complete for the time being.
I was supposed to meet with Paul Malgrati from Scottish Literature on Wednesday to discuss an interactive map of Burns Suppers he would like me to create. We decided to cancel our meeting due to the Coronavirus, but continued to communicate via email. Paul had sent me a spreadsheet containing data relating to the Burns Suppers and I spent some time working on some initial versions of the map, reusing some of the code from the Regional Romanticism map, which in turn used code from the SCOSYA map.
I migrated the spreadsheet to an online database and then wrote a script that exports this data in the JSON format that can be easily read into the map. The initial version uses OpenStreetMap.HOT as a basemap rather than the .DE one that Paul had selected as the latter displays all place-names in German where these are available (e.g. Großbritannien). The .HOT map is fairly similar, although for some reason parts of South America look like they’re underwater. We can easily change to an alternative basemap in future if required. In my initial version all locations are marked with red icons displaying a knife and fork. We can use other colours or icons to differentiate types if or when these are available. The map is full screen with an introductory panel in the top right. Hovering over an icon displays the title of the event while clicking on it replaces the introductory panel with a panel containing the information about the supper. The content is generated dynamically and only displays fields that contain data (e.g. very few include ‘Dress Code’). You can always return to the intro by clicking on the ‘Introduction’ button at the top.
I spotted a few issues with the latitude and longitude of some locations that will need fixed. E.g. St Petersburg has Russia as the country but it is positioned in St Petersburg in Florida while Bogota Burns night in Colombia is positioned in South Sudan. I also realised that we might want to think about grouping icons as when zoomed out it’s difficult to tell where there are multiple closely positioned icons – e.g. the two in Reykjavik and the two in Glasgow. However, grouping may be tricky if different locations are assigned different icons / types.
After further email discussions with Paul (and being sent a new version of the spreadsheet) I created an updated version of my initial map. This version incorporates the data from the spreadsheet and incorporates the new ‘Attendance’ field into the pop-up where applicable. It is also now possible to zoom further out, and also scroll past the international dateline and still see the data (in the previous version if you did this the data would not appear). I also integrated the Leaflet Plugin MarkerCluster (see https://github.com/Leaflet/Leaflet.markercluster) that very nicely handles clustering of markers. In this new version of my map markers are now grouped into clusters that split apart as you zoom in. I also added in an option to hide and show the pop-up area as on small screens (e.g. mobile phones) the area takes up a lot of space, and if you click on a marker that is already highlighted this now deselects the marker and closes the popup. Finally, I added a new ‘Filters’ section in the introduction that you can show or hide. This contains options to filter the data by period. The three periods are listed (all ‘on’ be default’) and you can deselect or select any of them. Doing so automatically updates the map to limit the markers to those that meet the criteria. This is ‘remembered’ as you click on other markers and you can update your criteria by returning to the introduction. I did wonder about adding a summary of the selected filters to the popup of every marker, but I think this will just add too much clutter, especially when viewing the map on smaller screens (these days most people access websites on tablets or phones). Here is an example of the map as it currently looks:
The main things left to do are adding more filters and adding in images and videos, but I’ll wait until Paul sends me more data before I do anything further. That’s all for this week. I’ll just need to see how work progresses over the next few weeks as with the schools now shut I’ll need to spent time looking after my son in addition to tackling my usual work.
I met with Fraser Dallachy on Monday to discuss his ongoing pilot Scots Thesaurus project. It’s been a while since I’ve been asked to do anything for this project and it was good to meet with Fraser and talk through some of the new automated processes he wanted me to try out. One thing he wanted to try was tag the DSL dictionary definitions for part of speech to see if we could then automatically pick out word forms that we could query against the Historical Thesaurus to try and place the headword within a category. I adapted a previous script I’d created that picked out random DSL entries. This script targetted main entries (i.e. not supplements) that were nouns, were monosemous and had one sense, had fewer than 5 variant spellings, single word headwords and ‘short’ definitions, with the option to specify what is meant by ‘short’ in terms of the number of characters. I updated the script to bring back all DOST entries that met these criteria and had definitions that were less than 1000 characters in length, which resulted in just under 18,000 rows being returned (but I will rerun the script with a smaller character count if Fraser wants to focus on shorter entries). The script also stripped out all citations and tags from the definition to prepare it for POS tagging. With this dataset exported as a CSV I then began experimenting with a POS Tagger. I decided to use the Stanford POS Tagger (https://nlp.stanford.edu/software/tagger.html) which can be run at the command line, and I created a PHP script that went through each row of the CSV, passed the prepared definition text to the Tagger, pulled in the output and stored it in a database. I left the process running overnight and it had completed the following morning. I then outputted the rows as a spreadsheet and sent them on to Fraser for feedback. Fraser also wanted to see about using the data from the Scots School Dictionary so I sent that on to him too.
I also did a little bit of work for the DSL, investigating why some geographically tagged information was not being displayed in the citations, and replied to a few emails from Heather Pagan of the Anglo-Norman Dictionary as she began to look into uploading new data to their existing and no longer supported dictionary management system. I also gave some feedback on a proposal written by Rachel Douglas, a lecturer in French. Although this is not within Critical Studies and should be something Luca Guariento looks at, he is currently on holiday so I offered to help out. I also set up an initial WordPress site for Matthew Creasey’s new project. This still needs some further work, but I’ll need further information from Matthew before I can proceed. On Wednesday I met with Jennifer Smith and E Jamieson to discuss a possible follow-on project for the Scots Syntax Atlas. We talked through some of the possibilities and I think the project has huge potential. I’ll be helping to write the Data Management Plan and other such technical things for the proposal in due course.
I met with Marc and Fraser on Friday to discuss our plans for updating the way dates are stored in the Historical Thesaurus, which will make it much more easy to associate labels with specific dates and to update the dates in future as we align the data with revisions from the OED. I’d previously written a script that generated the new dates and from these generated a new ‘full date’ field which I then matched against the original ‘full date’ to spot errors. The script identified 1,116 errors, but this week I updated my script to change the way it handled ‘b’ dates. These are the dates that appear after a slash and where the date after the slash is in the same decade as the main date only one digit should be displayed (e.g. 1975/6), but this is not done so consistently, with dates sometimes appearing as 1975/76. Where this happened my script was noting the row as an error, but Marc wanted these to be ignored. I updated my script to take this into consideration, and this has greatly reduced the number of rows that will need to be manually checked, reducing the output to just 284 rows.
I spent the rest of my time this week working on the Books and Borrowers project. Although this doesn’t officially begin until the summer I’m designing the data structure at the moment (as time allows) so that when the project does start the RAs will have a system to work with sooner rather than later. I mapped out all of the fields in the various sample datasets in order to create a set of ‘core’ fields, mapping the fields from the various locations to these ‘core’ fields. I also designed a system for storing additional fields that may only be found at one or two locations, are not ‘core’ but still need to be recorded. I then created the database schema needed to store the data in this format and wrote a document that details all of this which I sent to Katie Halsey and Matt Sangster for feedback.
Matt also sent me a new version of the Glasgow Student borrowings spreadsheet he had been working on, and I spent several hours on Friday getting this uploaded to the pilot online resource I’m working on. I experimented with a new method of extracting the data from Excel to try and minimise the number of rows that were getting garbled due to Excel’s horrible attempts to save files as HTML. As previously documented, the spreadsheet uses formatting in a number of columns (e.g. superscript, strikethrough). This formatting is lost if the contents of the spreadsheet are copied in a plain text way (so no saving as a CSV, or opening the file in Google Docs or just copying the contents). The only way to extract the formatting in a way that can be used is to save the file as HTML in Excel and then work with that. But the resulting HTML produced by Excel is awful, with hundreds of tags and attributes scattered across the file used in an inconsistent and seemingly arbitrary way.
For example, this is the HTML for one row:
<tr height=23 style=’height:17.25pt’>
<td height=23 width=64 style=’height:17.25pt;width:48pt’></td>
<td width=143 style=’width:107pt’>Charles Wilson</td>
<td width=187 style=’width:140pt’>Charles Wilson</td>
<td width=86 style=’width:65pt’>Charles</td>
<td width=158 style=’width:119pt’>Wilson<span
<td width=88 style=’width:66pt’>Nat. Phil.</td>
<td width=129 style=’width:97pt’>Natural Philosophy</td>
<td width=64 style=’width:48pt’>B</td>
<td class=xl69 width=81 style=’width:61pt’>10</td>
<td class=xl70 width=81 style=’width:61pt’>3</td>
<td width=250 style=’width:188pt’>Wells Xenophon vol. 3<font class=”font6″><sup>d</sup></font></td>
<td width=125 style=’width:94pt’>Mr Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’>Adam Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=89 style=’width:67pt’>22 Mar 1757</td>
<td width=89 style=’width:67pt’>10 May 1757</td>
<td align=right width=56 style=’width:42pt’>2</td>
<td width=64 style=’width:48pt’>4r</td>
<td class=xl71 width=64 style=’width:48pt’>1</td>
<td class=xl70 width=64 style=’width:48pt’>007</td>
<td class=xl65 width=325 style=’width:244pt’><a
<td width=293 style=’width:220pt’>Xenophon.</td>
<td width=392 style=’width:294pt’>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ <span style=’display:none’>cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</span></td>
<td width=110 style=’width:83pt’>Wells, Edward, 16<span style=’display:none’>67-1727.</span></td>
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll
<td align=right width=64 style=’width:48pt’>1</td>
<td align=right width=121 style=’width:91pt’>1</td>
<td width=64 style=’width:48pt’>T111427</td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
Previously I tried to fix this by running through several ‘find and replace’ passes to try and strip out all of the rubbish, while retaining what I needed, which was <tr>, <td> and some formatting tags such as <sup> for superscript.
This time I found a regular expression that removes all attributes from HTML tags, so for example <td width=64 style=’width:48pt’> becomes <td> (see it here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag). I could then pass the resulting contents of every <td> through PHP’s strip_tags function to remove any remaining tags that were not required (e.g. <span>) while specifying the tags to retain (e.g. <sup>).
This approach seemed to work very well until I analysed the resulting rows and realised that the columns of many rows were all out of synchronisation, meaning any attempt at programmatically extracting the data and inserting it into the correct field in the database would fail. After some further research I realised that Excel’s save as HTML feature was to blame yet again. Without there being any clear reason, Excel sometimes expands a cell into the next cell or cells if these cells are empty. An example of this can be found above and I’ve extracted it here:
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll Bi2-g.19-23</td>
The ‘colspan’ attribute means that the cell will stretch over multiple columns, in this case 2 columns, but elsewhere in the output file it was 3 and sometimes 4 columns. Where this happens the following cells simply don’t appear in the HTML. As my regular expression removed all attributes this ‘colspan’ was lost and the row ended up with subsequent cells in the wrong place.
Once I’d identified this I could update my script to check for the existence of ‘colspan’ before removing attributes, adding in the required additional empty cells as needed (so in the above case an extra <td></td>).
With all of this in place the resulting HTML was much cleaner. Here is the above row after my script had finished:
<td>Wells Xenophon vol. 3<sup>d</sup></td>
<td>22 Mar 1757</td>
<td>10 May 1757</td>
<td>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</td>
<td>Wells, Edward, 16<span>67-1727.</td>
I then updated my import script to pull in the new fields (e.g. normalised class and professor names), set it up so that it would not import any rows that had ‘Yes’ in the first column, and updated the database structure to accommodate the new fields too. The upload process then ran pretty smoothly and there are now 8145 records in the system. After that I ran the further scripts to generate dates, students, professors, authors, book names, book titles and classes and updated the front end as previously discussed. I still have the old data stored in separate database tables as well, just in case we need it, but I’ve tested out the front-end and it all seems to be working fine to me.
I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.
I divided most of my time between three projects this week. For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset. On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying. This process had finished on Monday, but unfortunately things had gone wrong during the processing. I was using the PHP function ‘fgetcsv’ to extract the data a line at a time. This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database. Unfortunately some of the data contained commas. In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database. After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again. It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong. Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8. I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8. It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time. However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again. Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.
After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows. I then wrote another script that would take this data and insert it into the project’s place-name table. The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field. The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish. I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them. I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.
The other thing I tried to do was to grab the altitude for each name via the Google Maps service. This proved to be a little tricky as the service blocks you if you make too many requests all at once. Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database. Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these). I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic. Sound files must be in the MP3 format.
The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site. I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest. There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page. I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens. I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.
My third major project of the week was the Historical Thesaurus. Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out. I managed to create a script that can process any date, including associating labels with the appropriate date. Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen. As of yet nothing is inserted into the database. I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field. I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly. At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values. We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null. Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’. I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.
The script displays each lexeme in a category with its ‘full date’ column. It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain. Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.
I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead. Then the ‘midlascon’ field is used to join the mid date to the last. This is an error as ‘firlastcon’ should not be used to join first and mid dates. An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’. There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.
After getting the script to sort out the dates I then began the look at labels. I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared. However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field. E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.
Instead I abandoned the ‘label’ field and just used the ‘full date’ field. Actually, I still use the ‘label’ field to check whether the script needs to process labels or not. Here’s a description of the logic for working out where a label should be added:
The dates are first split up into their individual boxes. Then, if there is a label for the lexeme I go through each date in turn. I split the full date field and look at the part after the date. I go through each character of this in turn. If the character is a ‘+’ then I stop. If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop. Otherwise if the character is a-z I note that I’ve started the label. If I’ve started the label and the current character is a number then I stop. Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached. After that if there is any label text it’s added to the date. This process seems to work. I did, however, have to fix how labels applied to ‘current’ dates are processed. For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string. I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.
In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.