This was the second week of the Coronavirus lockdown and I followed a similar arrangement to last week, managing to get a pretty decent amount of work done in between home-schooling sessions for my son. I spent most of my time working for the Books and Borrowing project. I had a useful conference call with the PI Katie Halsey and Co-I Matt Sangster last week, and the main outcome of that meeting for me was that I’d further expand upon the data design document I’d previously started in order to bring it into line with our understanding of the project’s requirements. This involved some major reworking of the entity-relationship diagram I had previously designed based on my work with the sample datasets, with the database structure increasing from 11 related tables to 21, incorporating a new system to trace books and their authors across different libraries, to include borrower cross-references and to greatly increase the data recorded about libraries. I engaged in many email conversations with Katie and Matt over the course of the week as I worked on the document, and on Friday I sent them a finalised version consisting of 34 pages and more than 7,000 words. This is still in ‘in progress’ version and will no doubt need further tweaks based on feedback and also as I build the system, but I’d say it’s a pretty solid starting point. My next step will be to add a new section to the document that describes the various features of the content management system that will connect to the database and enable to project’s RAs to add and edit data in a streamlined and efficient way.
Also this week I did some further work for the DSL people, who have noticed some inconsistencies with the way their data is stored in their own records compared to how it appears in the new editing system that they are using. I wasn’t directly involved in the process of getting their data into the new editing system but spent some time going through old emails, looking at the data and trying to figure out what might have happened. I also had a conference call with Marc Alexander and the Anglo-Norman Dictionary people to discuss the redevelopment of their website. It looks like this will be going ahead and I will be doing the redevelopment work. I’ll try to start on this after Easter, with my first task being the creation of a design document that will map out exactly what features the new site will include and how these relate to the existing site. I also need to help the AND people to try and export the most recent version of their data from the server as the version they have access to is more than a year old. We’re going to aim to relaunch the site in November, all being well.
I also had a chat with Fraser Dallachy about the new quiz I’m developing for the Historical Thesaurus. Fraser had a couple of good ideas about the quiz (e.g. making versions for Old and Middle English) that I’ll need to see about implementing in the coming weeks. I also had an email conversation with the other developers in the College of Arts about documenting the technologies that we use or have used in the past for projects and made a couple of further tweaks to the Burns Supper map based on feedback from Paul Malgrati.
I’m going to be on holiday next week and won’t be back to work until Wednesday the 15th of April so there won’t be any further updates from me for a while.
This was the first full week of the Coronavirus lockdown and as such I was working from home and also having to look after my nine year-old son who is also at home on lockdown. My wife and I have arranged to split the days into morning and afternoon shifts, with one of us home-schooling our son while the other works during each shift and extra work squeezed in before and after these shifts. The arrangement has worked pretty well for all of us this week and I’ve managed to get a fair amount of work done.
This included spotting and requesting fixes for a number of other sites that had started to display scary warnings about their SSL certificates, working on an updated version of the Data Management Plan for the SCOSYA follow-on proposal, fixing some log-in and account related issues for the DSL people and helping Carolyn Jess-Cooke in English Literature with some technical issues relating to a WordPress blog she has set up for a ‘Stay at home’ literary festival (https://stayathomefest.wordpress.com/). I also had a conference call with Katie Halsey and Matt Sangster about the Books and Borrowers project, which is due to start at the beginning of June. It was my first time using the Zoom videoconferencing software and it worked very well, other than my cat trying to participate several times. We had a good call and made some plans for the coming weeks and months. I’m going to try and get an initial version of the content management system and database for the project in place before the official start of the project so that the RAs will be able to use this straight away. This is of even greater importance now as they are likely to be limited in the kinds of research activities they can do at the start of the project because of travel restrictions and will need to work with digital materials.
Other than these issues I divided my time between three projects. The first was the Burns Supper map for Paul Malgrati in Scottish Literature. Paul had sent me some images that are to be used in the map and I spent some time integrating these. The image appears as a thumbnail with credit text (if available) appearing underneath. If there is a link to the place the image was taken from the credit text appears as a link. Clicking on the image thumbnail opens the full image in a new tab. I also added links to the videos where applicable too, but I decided not to embed the videos in the page as I think these would be too small and there would be just too much going on for locations that have both videos and an image. Paul also wanted clusters to be limited by areas (e.g. a cluster for Scotland rather than these just being amalgamated with a big cluster for Europe when zooming out) and I investigated this. I discovered that it is possible to create groups of locations. E.g. have a new column in the spreadsheet named ‘cluster’ or something like that and all the ‘Scotland’ ones could have ‘Scotland’ here, or all the South American ones could have ‘South America’ here. These will then be the top level clusters and they will not be further amalgamated on zoom out. Once Paul gets back to me with the clusters he would like for the data I’ll update things further. Below is an image of the map with the photos embedded:
The second major project I worked on was the interactive map for Gerry McKeever’s Regional Romanticism project. Gerry had got back to me with a new version of the data he’d been working on and some feedback from other people he’d sent the map to. I created a new version of the map featuring the new data and incorporated some changes to how the map worked based on feedback, namely I moved the navigation buttons to the top of the story pane and have made them bigger, with a new white dividing line between the buttons and the rest of the pane. This hopefully makes them more obvious to people and means the buttons are immediately visible rather than people potentially having to scroll to see them. I’ve also replaced the directional arrows with thicker chevron icons and have changed the ‘Return to start’ button to ‘Restart’. I’ve also made the ‘Next’ button on both the overview and the first slide blink every few seconds, at Gerry’s request. Hopefully this won’t be too annoying for people. Finally I made the slide number bigger too. Here’s a screenshot of how things currently look:
I then decided to chain several questions together to make the quiz more fun. Once the correct answer is given a ‘Next’ button appears, leading to a new question. I set up a ‘max questions’ variable that controls how many questions there are (e.g. 3, 5 or 10) and the questions keep coming until this number is reached. When the number is reached the user can then view a summary that tells them which words and (correct) categories were included, provides links to the categories and gives the user an overall score. I decided that if the user guesses correctly the first time they should get one star. If they guess correctly a second time they get half a star and any more guesses get no stars. The summary and star ratings for each question are also displayed as the following screenshot shows:
It’s shaping up pretty nicely, but I still need to work on the script that exports data from the database. Identifying random categories that contain at least one non-OE word and are of the same part of speech as the first randomly chosen category currently means hundreds or even thousands of database calls before a suitable category is returned. This is inefficient and occasionally the script was getting caught in a loop and timing out before it found a suitable category. I managed to catch this by having some sample data that loads if a suitable category isn’t found after 1000 attempts, but it’s not ideal. I’ll need to work on this some more over the next few weeks as time allows.
Last week was a full five-day strike and the end of the current period of UCU strike action. This week I returned to work, but the Coronavirus situation, which has been gradually getting worse over the past few weeks ramped up considerably, with the University closed for teaching and many staff working from home. I came into work from Monday to Wednesday but the West End was deserted and there didn’t seem much point in me using public transport to come into my office when there was no-one else around so from Thursday onwards I began to work from home, as I will be doing for the foreseeable future.
Despite all of these upheavals and also suffering from a pretty horrible cold I managed to get a lot done this week. Some of Monday was spend catching up with emails that had come in whilst I had been on strike last week, including a request from Rhona Alcorn of SLD to send her the data and sound files from the Scots School Dictionary and responding to Alan Riach from Scottish Literature about some web pages he wanted updated (these were on the main University site and this is not something I am involved with updating). I also noticed that the version of this site that was being served up was the version on the old server, meaning my most recent blog posts were not appearing. Thankfully Raymond Brasas in Arts IT Support was able to sort this out. Raymond had also emailed me about some WordPress sites I mange that had out of date versions of the software installed. There were a couple of sites that I’d forgotten about, a couple that were no longer operational and a couple that had legitimate reasons for being out of date, so I got back to him about those, and also updated my spreadsheet of WordPress sites I manage to ensure the ones I’d forgotten about would not be overlooked again. I also became aware of SSL certificate errors on a couple of websites that were causing the sites to display scary warning messages before anyone could reach the sites, so asked Raymond to fix these. Finally, Fraser Dallachy, who is working on a pilot for a new Scots Thesaurus, contacted me to see if he could get access to the files that were used to put together the first version of the Concise Scots Dictionary. We had previously established that any electronic files relating to the printed Scots Thesaurus have been lost and he was hoping that these old dictionary files may contain data that was used in this old thesaurus. I managed to track the files down, but alas there appeared to be no semantic data in the entries found therein. I also had a chat with Marc Alexander about a little quiz he would like to develop for the Historical Thesaurus.
I spoke to Jennifer Smith on Monday about the follow-on funding application for her SCOSYA project and spent a bit of time during the week writing a first draft of a Data Management Plan for the application, after reviewing all of the proposal materials she had sent me. Writing the plan raised some questions and I will no doubt have to revise the plan before the proposal is finalised, but it was good to get a first version completed and sent off.
I also finished work on the interactive map for Gerry McKeever’s Regional Romanticism project this week. Previously I’d started to use a new plugin to get nice curved lines between markers and all appeared to be working well. This week I began to integrate the plugin with my map, but unfortunately I’m still encountering unusable slowdown with the new plugin. Everything works fine to begin with, but after a bit of scrolling and zooming, especially round an area with lots of lines, the page becomes unresponsive. I wondered whether the issue might be related to the midpoint of the curve being dynamically generated from a function I took from another plugin so instead made a version that generated and then saved these midpoints that could then be used without needing to be calculated each time. This would also have meant that we could have manually tweaked the curves to position them as desired, which would have been great as some lines were not ideally positioned (e.g. from Scotland to the US via the North Pole), but even this seems to have made little impact on the performance issues. I even tried turning everything else off (e.g. icons, popups, the NLS map) to see if I could identify another cause of the slowdown but nothing has worked. I unfortunately had to admit defeat and resort to using straight lines after all. These are somewhat less visually appealing, but they result in no performance issues. Here’s a screenshot of this new version:
With these updates in place I made a version of the map that would run directly on the desktop and sent Gerry some instructions on how to update the data, meaning he can continue to work on it and see how it looks. But my work on this is now complete for the time being.
I was supposed to meet with Paul Malgrati from Scottish Literature on Wednesday to discuss an interactive map of Burns Suppers he would like me to create. We decided to cancel our meeting due to the Coronavirus, but continued to communicate via email. Paul had sent me a spreadsheet containing data relating to the Burns Suppers and I spent some time working on some initial versions of the map, reusing some of the code from the Regional Romanticism map, which in turn used code from the SCOSYA map.
I migrated the spreadsheet to an online database and then wrote a script that exports this data in the JSON format that can be easily read into the map. The initial version uses OpenStreetMap.HOT as a basemap rather than the .DE one that Paul had selected as the latter displays all place-names in German where these are available (e.g. Großbritannien). The .HOT map is fairly similar, although for some reason parts of South America look like they’re underwater. We can easily change to an alternative basemap in future if required. In my initial version all locations are marked with red icons displaying a knife and fork. We can use other colours or icons to differentiate types if or when these are available. The map is full screen with an introductory panel in the top right. Hovering over an icon displays the title of the event while clicking on it replaces the introductory panel with a panel containing the information about the supper. The content is generated dynamically and only displays fields that contain data (e.g. very few include ‘Dress Code’). You can always return to the intro by clicking on the ‘Introduction’ button at the top.
I spotted a few issues with the latitude and longitude of some locations that will need fixed. E.g. St Petersburg has Russia as the country but it is positioned in St Petersburg in Florida while Bogota Burns night in Colombia is positioned in South Sudan. I also realised that we might want to think about grouping icons as when zoomed out it’s difficult to tell where there are multiple closely positioned icons – e.g. the two in Reykjavik and the two in Glasgow. However, grouping may be tricky if different locations are assigned different icons / types.
After further email discussions with Paul (and being sent a new version of the spreadsheet) I created an updated version of my initial map. This version incorporates the data from the spreadsheet and incorporates the new ‘Attendance’ field into the pop-up where applicable. It is also now possible to zoom further out, and also scroll past the international dateline and still see the data (in the previous version if you did this the data would not appear). I also integrated the Leaflet Plugin MarkerCluster (see https://github.com/Leaflet/Leaflet.markercluster) that very nicely handles clustering of markers. In this new version of my map markers are now grouped into clusters that split apart as you zoom in. I also added in an option to hide and show the pop-up area as on small screens (e.g. mobile phones) the area takes up a lot of space, and if you click on a marker that is already highlighted this now deselects the marker and closes the popup. Finally, I added a new ‘Filters’ section in the introduction that you can show or hide. This contains options to filter the data by period. The three periods are listed (all ‘on’ be default’) and you can deselect or select any of them. Doing so automatically updates the map to limit the markers to those that meet the criteria. This is ‘remembered’ as you click on other markers and you can update your criteria by returning to the introduction. I did wonder about adding a summary of the selected filters to the popup of every marker, but I think this will just add too much clutter, especially when viewing the map on smaller screens (these days most people access websites on tablets or phones). Here is an example of the map as it currently looks:
The main things left to do are adding more filters and adding in images and videos, but I’ll wait until Paul sends me more data before I do anything further. That’s all for this week. I’ll just need to see how work progresses over the next few weeks as with the schools now shut I’ll need to spent time looking after my son in addition to tackling my usual work.
This was another strike week, and I only worked on Friday. Next week will be a full five-day strike. On Friday I caught up with a few emails about future projects from Paul Malgrati and Sourit Bhattacharya, read through some of the documentation for the SCOSYA follow-on funding proposal that is currently in development and had a further email conversation with Heather Pagan regarding the redevelopment of the Anglo-Norman Dictionary. Other than that I focussed on the interactive map for Gerry McKeever’s Regional Romanticism project. Last week I’d imported the new data and made a number of changes to the map, but had discovered that the plugin I was using to give nice Bezier curve lines between markers was not scaling well, and was making the map unusably slow. This week I experimented a bit more with the plugin, trying to find a more efficient means of making the lines appear, but although the alternative I came up with was slightly less inefficient it was still pretty much unusable. I therefore began looking into alternative libraries. There is another called leaflet.curve (https://github.com/elfalem/Leaflet.curve) that is looking promising, but the only downside is you need to specify a latitude and longitude for the midpoint of the curve in addition to the start and end points. I didn’t want to do this for all 80-odd locations without seeing if the result would be usable so instead I created a test map that uses the same midpoint for all lines, which you can view below:
Even with all of the lines displayed the map loads without any noticeable lag, so I’d say this is looking promising. After doing this I looked at the first plugin I’d used again and realised that it included a function that automatically calculates the latitude and longitude of the midpoint of a curve between two lat/lon pairings that are passed to it. I wondered whether I could just rip this function out of the first plugin and incorporate it into the second to avoid having to manually work out where the midpoint of each line should go. I wasn’t sure if this would work but it looks like it has, as you can see here:
This second map does seem to have the occasional performance issues and I’ll need to test it out more fully, but it is a considerable improvement on the original map. If it does prove to be too laggy once I update the actual interface I can still use the function to output the midpoint values and save them as static lat/lon values. I’m still going to have to change quite a bit of the map code to replace the first plugin with this new version, and I didn’t have time to do so this week, but I’d say things are looking very promising.
It was another strike week, and this time I only worked on Thursday and Friday. I spent a lot of Thursday catching up with emails and thinking about the technical implications of potential projects that people had contacted me about. I responded to a PhD student who had asked me for advice about a Wellcome Trust application and I spent quite a bit of time going through the existing Anglo-Norman Dictionary website to get a better understanding of how it works, the features it offers and how it could be improved. I’d also been contacted by two members of Critical Studies staff who wanted some technical work doing. The first was Paul Malgrati, who is wanting to put together an interactive map of Burns Suppers across the world. He’d sent me a spreadsheet containing the data he’s so far compiled and I spent some time going through this and replying to his email with some suggestions. The second was Sourit Bhattacharya, who is submitting a Carnegie application to develop an online bibliography. I considered his requirements and replied to him with some ideas.
I spent most of Friday working on the interactive map that plots important locations in a novel for Gerry McKeever’s Regional Romanticism project. I created an initial version of this back in January and since then Gerry has been working on his data, extending it to cover all three volumes of the novel and locations across the globe. He’d also made some changes to the structure of the data (e.g. page numbers had been separated out from the extracts to a new column in the spreadsheet) and had greatly enhanced the annotations made for each item. He had also written some new introductory text. The new data consisted of 120 entries across 88 distinct locations, and I needed to convert this from an Excel spreadsheet into a JSON file, with locations separated out and linked to multiple entries to enable map markers to be associated with multiple items. It took several hours to do this. I did consider writing a script to handle to conversion but that would have taken some time in itself and this is the last time the entire dataset will be migrated so the script would never be needed again. Plus undertaking the task manually gave me the opportunity to check the data and to ensure that the same locations mentioned in different entries all matched up.
I also created a new version of the interactive map that used the new data and incorporated some other updates too. This version has the new intro slide and hopefully has all of the updates to the existing data plus all of the new data for all three volumes. I changed the display slightly to include page numbers as a separate field (the information now appears in the ‘volume’ section) and also to include the notes. To differentiate these from the extracts I’ve enclosed the extracts in large speech marks.
One issue I encountered is that with all of the new data in place the loading of new items and panning to them slowed to a crawl. The map unfortunately became absolutely unusable. After some investigation I realised that the plugin that makes the nice curved lines between locations was to blame. It appears to not be at all scalable and I had to disable the grey dotted lines on the map as with all of the markers visible the map was unusable. I may be able to fix this, or I may have to switch to an alternative plugin for the lines as the one that’s currently used appears to be horribly inefficient and unscalable. However, the yellow line connecting the current marker with the previous one is still visible as you scroll around the locations and I think this is the most important thing. Below is a screenshot of the map as it currently stands. Next week I will be continuing with the UCU strike action and will therefore only be working on the Friday.
This was a three-day week for me as I was participating in the UCU strike action on Thursday and Friday. I spent the majority of Monday to Wednesday tying up loose ends and finishing off items on my ‘to do’ list. The biggest task I tackled was to relaunch the Digital Humanities at Glasgow site. This involved removing all of the existing site and moving the new site from its test location to the main URL. I had to write a little script that changed all of the image URLs so they would work in the new location and I needed to update WordPress so it knew where to find all of the required files. Most of the migration process went pretty smoothly, but there were some slightly tricky things, such as ensuring the banner slideshow continued to work. I also needed to tweak some of the static pages (e.g. the ‘Team’ and ‘About’ pages) and I added in a ‘contact us’ form. I also put in redirects from all of the old pages so that any bookmarks or Google links will continue to work. As of yet I’ve had no feedback from Marc or Lorna about the redesign of the site, so I can only assume that they are happy with how things are looking. The final task in setting up the new site was to migrate my blog over from its old standalone URL to become integrated in the new DH site. I exported all of my blog posts and categories and imported them into the new site using WordPress’s easy to use tools, and that all went very smoothly. The only thing that didn’t transfer over using this method was the media. All of the images embedded in my blog posts still pointed to image files located on the old server so I had to manually copy these images over and then I wrote a script that went through every blog post and found and replaced all image URLs. All now appears to be working, and this is the first blog post I’m making using the new site so I’ll know for definite once I add this post to the site.
Other than setting up this new resource I made a further tweak to the new data for Matthew Sangster’s 18th century student borrowing records that I was working on last week. I had excluded rows from his spreadsheet that had ‘Yes’ in the ‘omit’ column, assuming that these were not to be processed by my upload script, but actually Matt wanted these to be in the system and displayed when browsing pages, but omitted from any searches. I therefore updated the online database to include a new ‘omit’ column, updated my upload script to only process these omitted rows and then changed the search facilities to ignore any rows that have a ‘Y’ in the ‘omit’ column.
I also responded to a query from Alasdair Whyte regarding parish boundaries for his Place-names of Mull and Ulva project, and investigated an issue that Marc Alexander was experiencing with one of the statistics scripts for the HT / OED matching process (it turned out to be an issue with a bad internet connection rather than an issue with the script). I’d had a request from Paul Malgrati in Scottish Literature about creating an online resource that maps Burns Suppers so I wrote a detailed reply discussing the various options.
I also spent some time fixing the issue of citation links not displaying exactly as the DSL people were wanting in the DSL website. This was surprisingly difficult to implement because the structure can vary quite a bit. The ID needed for the link is associated with the ‘cref’ that wraps the whole reference, but the link can’t be applied to the full contents as only authors and titles should be links, not geo and date tags or other non-tagged text that appears in the element. There may be multiple authors or no author so sometimes the link needs to start before the first (and only the first) author whereas other times the link needs to start before the title. As there is often text after the last element that needs to be linked the closing tag of the link can’t just be appended to the text but instead the script needs to find where this last element ends. However, it looks like I’ve figured out a way to do it that appears to work.
I devoted a few spare hours on Wednesday to investigating the Anglo-Norman Dictionary. Heather was trying to figure out where on the server the dataset that the website uses is located, and after reading through the documentation I managed to figure out that the data is stored in a Berkeley DB and it looks like the data the system uses is stored in a file called ‘entry_hash’. There is a file with this name in ‘/var/data’ and it’s just over 200Mb in size, which suggests it contains a lot of data. Software that can read this file can be downloaded from here: https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html and the Java edition should work on a Windows PC if it has Java installed. Unfortunately you have to register to access the files and I haven’t done so yet, but I’ve let Heather know about this.
I then experimented with setting up a simple AND website using the data from ‘all.xml’, which (more or less) contains all of the data for the dictionary. My test website consists of a database, one script that goes through the XML file and inserts each entry into the database, and one script that displays a page allowing you to browse and view all entries. My AND test is really very simple (and is purely for test purposes) – the ‘browse’ from the live site (which can be viewed here: http://www.anglo-norman.net/gate/) is replicated, only it currently displays in its entirety, which makes the page a little slow to load. Cross references are in yellow, main entries in white. Click on a cross reference and it loads the corresponding main entry, click on a regular headword to load its entry. Currently only some of the entry page is formatted, and some elements don’t always display correctly (e.g. the position of some notes). The full XML is displayed below the formatted text. Here’s an example:
Clearly there would still be a lot to do to develop a fully functioning replacement for AND, but it wouldn’t take a huge amount of time (I’d say it could be measured in days not weeks). It just depends whether the AND people want to replace their old system.
I’ll be on strike again next week from Monday to Wednesday.
I met with Fraser Dallachy on Monday to discuss his ongoing pilot Scots Thesaurus project. It’s been a while since I’ve been asked to do anything for this project and it was good to meet with Fraser and talk through some of the new automated processes he wanted me to try out. One thing he wanted to try was tag the DSL dictionary definitions for part of speech to see if we could then automatically pick out word forms that we could query against the Historical Thesaurus to try and place the headword within a category. I adapted a previous script I’d created that picked out random DSL entries. This script targetted main entries (i.e. not supplements) that were nouns, were monosemous and had one sense, had fewer than 5 variant spellings, single word headwords and ‘short’ definitions, with the option to specify what is meant by ‘short’ in terms of the number of characters. I updated the script to bring back all DOST entries that met these criteria and had definitions that were less than 1000 characters in length, which resulted in just under 18,000 rows being returned (but I will rerun the script with a smaller character count if Fraser wants to focus on shorter entries). The script also stripped out all citations and tags from the definition to prepare it for POS tagging. With this dataset exported as a CSV I then began experimenting with a POS Tagger. I decided to use the Stanford POS Tagger (https://nlp.stanford.edu/software/tagger.html) which can be run at the command line, and I created a PHP script that went through each row of the CSV, passed the prepared definition text to the Tagger, pulled in the output and stored it in a database. I left the process running overnight and it had completed the following morning. I then outputted the rows as a spreadsheet and sent them on to Fraser for feedback. Fraser also wanted to see about using the data from the Scots School Dictionary so I sent that on to him too.
I also did a little bit of work for the DSL, investigating why some geographically tagged information was not being displayed in the citations, and replied to a few emails from Heather Pagan of the Anglo-Norman Dictionary as she began to look into uploading new data to their existing and no longer supported dictionary management system. I also gave some feedback on a proposal written by Rachel Douglas, a lecturer in French. Although this is not within Critical Studies and should be something Luca Guariento looks at, he is currently on holiday so I offered to help out. I also set up an initial WordPress site for Matthew Creasey’s new project. This still needs some further work, but I’ll need further information from Matthew before I can proceed. On Wednesday I met with Jennifer Smith and E Jamieson to discuss a possible follow-on project for the Scots Syntax Atlas. We talked through some of the possibilities and I think the project has huge potential. I’ll be helping to write the Data Management Plan and other such technical things for the proposal in due course.
I met with Marc and Fraser on Friday to discuss our plans for updating the way dates are stored in the Historical Thesaurus, which will make it much more easy to associate labels with specific dates and to update the dates in future as we align the data with revisions from the OED. I’d previously written a script that generated the new dates and from these generated a new ‘full date’ field which I then matched against the original ‘full date’ to spot errors. The script identified 1,116 errors, but this week I updated my script to change the way it handled ‘b’ dates. These are the dates that appear after a slash and where the date after the slash is in the same decade as the main date only one digit should be displayed (e.g. 1975/6), but this is not done so consistently, with dates sometimes appearing as 1975/76. Where this happened my script was noting the row as an error, but Marc wanted these to be ignored. I updated my script to take this into consideration, and this has greatly reduced the number of rows that will need to be manually checked, reducing the output to just 284 rows.
I spent the rest of my time this week working on the Books and Borrowers project. Although this doesn’t officially begin until the summer I’m designing the data structure at the moment (as time allows) so that when the project does start the RAs will have a system to work with sooner rather than later. I mapped out all of the fields in the various sample datasets in order to create a set of ‘core’ fields, mapping the fields from the various locations to these ‘core’ fields. I also designed a system for storing additional fields that may only be found at one or two locations, are not ‘core’ but still need to be recorded. I then created the database schema needed to store the data in this format and wrote a document that details all of this which I sent to Katie Halsey and Matt Sangster for feedback.
Matt also sent me a new version of the Glasgow Student borrowings spreadsheet he had been working on, and I spent several hours on Friday getting this uploaded to the pilot online resource I’m working on. I experimented with a new method of extracting the data from Excel to try and minimise the number of rows that were getting garbled due to Excel’s horrible attempts to save files as HTML. As previously documented, the spreadsheet uses formatting in a number of columns (e.g. superscript, strikethrough). This formatting is lost if the contents of the spreadsheet are copied in a plain text way (so no saving as a CSV, or opening the file in Google Docs or just copying the contents). The only way to extract the formatting in a way that can be used is to save the file as HTML in Excel and then work with that. But the resulting HTML produced by Excel is awful, with hundreds of tags and attributes scattered across the file used in an inconsistent and seemingly arbitrary way.
For example, this is the HTML for one row:
<tr height=23 style=’height:17.25pt’>
<td height=23 width=64 style=’height:17.25pt;width:48pt’></td>
<td width=143 style=’width:107pt’>Charles Wilson</td>
<td width=187 style=’width:140pt’>Charles Wilson</td>
<td width=86 style=’width:65pt’>Charles</td>
<td width=158 style=’width:119pt’>Wilson<span
<td width=88 style=’width:66pt’>Nat. Phil.</td>
<td width=129 style=’width:97pt’>Natural Philosophy</td>
<td width=64 style=’width:48pt’>B</td>
<td class=xl69 width=81 style=’width:61pt’>10</td>
<td class=xl70 width=81 style=’width:61pt’>3</td>
<td width=250 style=’width:188pt’>Wells Xenophon vol. 3<font class=”font6″><sup>d</sup></font></td>
<td width=125 style=’width:94pt’>Mr Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’>Adam Smith</td>
<td width=124 style=’width:93pt’></td>
<td width=124 style=’width:93pt’></td>
<td width=89 style=’width:67pt’>22 Mar 1757</td>
<td width=89 style=’width:67pt’>10 May 1757</td>
<td align=right width=56 style=’width:42pt’>2</td>
<td width=64 style=’width:48pt’>4r</td>
<td class=xl71 width=64 style=’width:48pt’>1</td>
<td class=xl70 width=64 style=’width:48pt’>007</td>
<td class=xl65 width=325 style=’width:244pt’><a
<td width=293 style=’width:220pt’>Xenophon.</td>
<td width=392 style=’width:294pt’>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ <span style=’display:none’>cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</span></td>
<td width=110 style=’width:83pt’>Wells, Edward, 16<span style=’display:none’>67-1727.</span></td>
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll
<td align=right width=64 style=’width:48pt’>1</td>
<td align=right width=121 style=’width:91pt’>1</td>
<td width=64 style=’width:48pt’>T111427</td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
<td width=64 style=’width:48pt’></td>
Previously I tried to fix this by running through several ‘find and replace’ passes to try and strip out all of the rubbish, while retaining what I needed, which was <tr>, <td> and some formatting tags such as <sup> for superscript.
This time I found a regular expression that removes all attributes from HTML tags, so for example <td width=64 style=’width:48pt’> becomes <td> (see it here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag). I could then pass the resulting contents of every <td> through PHP’s strip_tags function to remove any remaining tags that were not required (e.g. <span>) while specifying the tags to retain (e.g. <sup>).
This approach seemed to work very well until I analysed the resulting rows and realised that the columns of many rows were all out of synchronisation, meaning any attempt at programmatically extracting the data and inserting it into the correct field in the database would fail. After some further research I realised that Excel’s save as HTML feature was to blame yet again. Without there being any clear reason, Excel sometimes expands a cell into the next cell or cells if these cells are empty. An example of this can be found above and I’ve extracted it here:
<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll Bi2-g.19-23</td>
The ‘colspan’ attribute means that the cell will stretch over multiple columns, in this case 2 columns, but elsewhere in the output file it was 3 and sometimes 4 columns. Where this happens the following cells simply don’t appear in the HTML. As my regular expression removed all attributes this ‘colspan’ was lost and the row ended up with subsequent cells in the wrong place.
Once I’d identified this I could update my script to check for the existence of ‘colspan’ before removing attributes, adding in the required additional empty cells as needed (so in the above case an extra <td></td>).
With all of this in place the resulting HTML was much cleaner. Here is the above row after my script had finished:
<td>Wells Xenophon vol. 3<sup>d</sup></td>
<td>22 Mar 1757</td>
<td>10 May 1757</td>
<td>Opera quae extant omnia; unà cum
chronologiâa Xenophonteâ cl. Dodwelli, et quatuor
tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</td>
<td>Wells, Edward, 16<span>67-1727.</td>
I then updated my import script to pull in the new fields (e.g. normalised class and professor names), set it up so that it would not import any rows that had ‘Yes’ in the first column, and updated the database structure to accommodate the new fields too. The upload process then ran pretty smoothly and there are now 8145 records in the system. After that I ran the further scripts to generate dates, students, professors, authors, book names, book titles and classes and updated the front end as previously discussed. I still have the old data stored in separate database tables as well, just in case we need it, but I’ve tested out the front-end and it all seems to be working fine to me.
I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.
I divided most of my time between three projects this week. For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset. On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying. This process had finished on Monday, but unfortunately things had gone wrong during the processing. I was using the PHP function ‘fgetcsv’ to extract the data a line at a time. This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database. Unfortunately some of the data contained commas. In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database. After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again. It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong. Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8. I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8. It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time. However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again. Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.
After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows. I then wrote another script that would take this data and insert it into the project’s place-name table. The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field. The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish. I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them. I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.
The other thing I tried to do was to grab the altitude for each name via the Google Maps service. This proved to be a little tricky as the service blocks you if you make too many requests all at once. Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database. Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these). I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic. Sound files must be in the MP3 format.
The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site. I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest. There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page. I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens. I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.
My third major project of the week was the Historical Thesaurus. Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out. I managed to create a script that can process any date, including associating labels with the appropriate date. Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen. As of yet nothing is inserted into the database. I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field. I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly. At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values. We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null. Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’. I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.
The script displays each lexeme in a category with its ‘full date’ column. It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain. Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.
I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead. Then the ‘midlascon’ field is used to join the mid date to the last. This is an error as ‘firlastcon’ should not be used to join first and mid dates. An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’. There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.
After getting the script to sort out the dates I then began the look at labels. I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared. However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field. E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.
Instead I abandoned the ‘label’ field and just used the ‘full date’ field. Actually, I still use the ‘label’ field to check whether the script needs to process labels or not. Here’s a description of the logic for working out where a label should be added:
The dates are first split up into their individual boxes. Then, if there is a label for the lexeme I go through each date in turn. I split the full date field and look at the part after the date. I go through each character of this in turn. If the character is a ‘+’ then I stop. If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop. Otherwise if the character is a-z I note that I’ve started the label. If I’ve started the label and the current character is a number then I stop. Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached. After that if there is any label text it’s added to the date. This process seems to work. I did, however, have to fix how labels applied to ‘current’ dates are processed. For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string. I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.
In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.
I spent quite a bit of time this week continuing to work on the systems for the Place-names of Mull and Ulva project. The first thing I did was to figure out how WordPress sets the language code. It has a function called ‘get_locale()’ that bring back the code (e.g. ‘En’ or ‘Gd’). Once I knew this I could update the site’s footer to display a different logo and text depending on the language the page is in. So now if the page is in English the regular UoG logo and English text crediting the map and photo are displayed whereas is the page is in Gaelic the Gaelic UoG logo and credit text is displayed. I think this is working rather well.
I managed to get all of the new Gaelic fields added into the CMS and fully tested by Thursday and asked Alasdair to testing things out. I also had a discussion with Rachel Opitz in Archaeology about incorporating LIDAR data into the maps and started to look at how to incorporate data from the GB1900 project for the parishes we are covering. GB1900 (http://www.gb1900.org/) was a crowdsourced project to transcribe every place-name that appears on OS maps from 1888-1914, which resulted in more than 2.5 million transcriptions. The dataset is available to download as a massive CSV file (more than 600Mb). It includes place-names for the three parishes on Mull and Ulva and Alasdair wanted to populate the CMS with this data as a starting point. On Friday I started to investigate how to access the information. Extracting the data manually from such a large CSV file wasn’t feasible so instead I created a MySQL database and wrote a little PHP script that iterated through each line of the CSV and added it to the database. I left this running over the weekend and will continue to work with it next week.
Also this week I continued to add new project records to the new Digital Humanities at Glasgow site. I only have about 30 more sites to add now, and I think it’s shaping up to be a really great resource that we will hopefully be able to launch in the next month or so.
I also spent a bit of further time on the SCOSYA project. I’d asked the university’s research data management people whether they had any advice on how we could share our audio recording data with other researchers around the world. The dataset we have is about 117GB, and originally we’d planned to use the University’s file transfer system to share the files. However, this can only handle files that are up to 20Gb in size, which meant splitting things up. And it turned out to take an awfully long time to upload the files, a process we would have to do each time the data was requested. The RDM people suggested we use the University’s OneDrive system instead. This is part of Office365 and gives each member of staff 1TB of space, and it’s possible to share uploaded files with others. I tried this out and the upload process was very swift. It was also possible to share the files with users based on their email addresses, and to set expiration dates and password for file access. It looks like this new method is going to be much better for the project and for any researchers who want to access our data. We also set up a record about the dataset in the Enlighten Research Data repository: http://researchdata.gla.ac.uk/951/ which should help people find the data.
Also for SCOSYA we ran into some difficulties with Google’s reCAPTCHA service, which we were using to protect the contact forms on our site from spam submissions. There was an issue with version 3 of Google’s reCAPTCHA system when integrated with the contact form plugin. It works fine if Google thinks you’re not a spammer but if you somehow fail its checks it doesn’t give you the option of proving you’re a real person, it just blocks the submission of the form. I haven’t been able to find a solution for this using v3, but thankfully there is a plugin that allows the contact form plugin to revert back to using reCAPTCHA v2 (the ‘I am not a robot’ tickbox). I got this working and have applied it to both the contact form and the spoken corpus form and it works for me as someone Google somehow seems to trust and for me when using IE via remote desktop, where Google makes me select features in images before the form submits.
Also this week I met with Marc and Fraser to discuss further developments for the Historical Thesaurus. We’re going to look at implementing the new way of storing and managing dates that I originally mapped out last summer and so we met on Friday to discuss some of the implications of this. I’m hoping to find some time next week to start looking into this.
We received the reviews for the Iona place-name project this week and I spent some time during the week and over the weekend going through the reviews, responding to any technical matters that were raised and helping Thomas Clancy with the overall response, that needed to be submitted the following Monday. I also spoke to Ronnie Young about the Burns Paper Database, that we may now be able to make publicly available, and made some updates to the NME digital ode site for Bryony Randall.