I again split my time mostly between REELS and Linguistic DNA and the Historical Thesaurus this week. For REELS, Carole had sent an email with lots of feedback and suggestions, so I spent some time addressing these. This included replacing the icon I’d chosen for settlements, and updating the default map zoom level to be a bit further out, so that the entire county fits on screen initially. I also updated the elements glossary ordering so that Old English “æ” and “þ” appear as if they were ‘ae’ and ‘th’ rather than at the end of the lists, and set the ordering to ignore diacritics, which were messing up the ordering a little. I also took the opportunity to update the display of the glossary so that the whole entry box for each item isn’t a link. This is because I’ve realised that some entries (e.g. St Leonard) have their own ‘find out more’ link and having a link within a link is never a good idea. Instead, there is now a ‘Search’ button at the bottom right of the entry, and if the ‘find out more’ button is present this appears next to it. I’ve changed the styling of the number of place-names and historical forms in the top right to make them look less like buttons too.
I also updated the default view of the map so that the ‘unselected’ data doesn’t appear on the map by default. You now have to manually tick the checkbox in the legend to add these in if you want them. When they are added in they appear ‘behind’ the other map markers rather than appearing on top of them, which was previously happening if you turned off the grey dots then turned them on again.
Leaflet has a method called ‘bringToBack’, which can be used to change the ordering of markers. Unfortunately you can’t apply this to an entire layer group (i.e. apply it to all grey dots in my grey dots group with one call). It took me a bit of time to figure out why this wasn’t working, but eventually I figured out I needed to call the ‘eachLayer’ method on my layer group to iterate over the contents and apply the ‘bringToBack’ method to each individual grey dot.
In addition to this update, I also set it so that changing marker categorisation in the ‘Display Options’ section now keeps the ‘Unselected’ dots off unless you choose to turn them on. I think this will be better for most users. I know when testing the map and changing categorisation the first thing I always then did was turn off the grey dots to reduce the clutter.
Carole had also pointed out an issue with the browse for sources, in that one source was appearing out of its alphabetical order and with more associated place-names than it should have. It turned out that this was a bug introduced when I’d previously added a new field for the browse list that strips out all tags (e.g. italics) from the title. This field gets populated when the source record is created or edited in the CMS. Unfortunately, I’d forgotten that sources can be added and edited directly through the add / edit historical forms page too, and I hadn’t added in the code to populate the field in these places. This meant that the field was being left blank, resulting in strange ordering and place-name numbering in the browse source page.
The biggest change that Carole had suggested was to the way in which date searches work. Rather than having the search and browse options allow the user to find place-names that have historical forms with a start / end date within the selected date or date range, Carole reckoned that identifying the earliest date for a place-name would be more useful. This was actually a pretty significant change, requiring a rewrite of large parts of the API, but I managed to get it all working. End dates have now been removed from the search and browse. The ‘browse start date’ looks for the earliest recorded start date rather than bringing back a count of place-names that have any historical form with the specified year, which I agree is much more useful. The advanced search now allows you to specify a single year, a range of years, or you can use ‘<’ and ‘>’ to search for place-names whose earliest historical form has a start date before or after a particular date.
I also finally got round to replacing the base maps with free alternatives this week. I was previously using MapBox maps for all but one of our base maps, but as MapBox only allows 50,000 map views in a month, and I’d managed almost 10,000 myself, we agreed that we couldn’t rely so heavily on the service, as the project has no ongoing funds. Thanks to some very useful advice from Chris Fleet at the NLS, I managed to switch to some free alternatives, including three that are hosted by the NLS Maps people themselves. The Default view is now Esri Topomap, the satellite view is now Esri WorldImagery (both free). Satellite with labels is still MapBox (the only one now). I’ve also included modern OS maps, courtesy of NLS, OS maps 1840-1880 from NLS and OS maps 1920-1933 as before. We now have six base maps to choose from, and I think the resource is looking pretty good. Here’s an example with OS Maps from the 1840s to 1880s selected:
For Linguistic DNA this week I continued to monitor my script that I’d set running last week to extract frequency data about the usage of Thematic Headings per decade in all of the EEBO data I have access to. I had hoped that the process would have completed by Monday, and it probably would have done, were it not for the script running out of memory as it tried to tackle the category ‘AP:04 Number’. This category is something of an outlier, and contains significantly more data than the other categories. It contains more than 2,600,000 rows, of which almost 200,000 are unique. My script stores all unique words in an associative array, with frequencies for each decade then added to it. The more unique words the larger the array and the more memory required. I skipped over the category and my script successfully dealt with the remaining categories, finishing the processing on Wednesday. I then temporarily updated the PHP settings to remove memory restrictions and set my script to deal with ‘AP:04’, which took a while but completed successfully, resulting in a horribly large spreadsheet containing almost 200,000 rows. I zipped the resulting 2,077 CSV files up and sent them on to the DHI people in Sheffield, who are going to incorporate this data into the LDNA resource.
For the Historical Thesaurus I continued to work on the new Timeline feature, this time adding in mini-timelines that will appear beside each word on the category page. Marc suggested using the ‘Bullet Chart’ option that’s available in the jQuery Sparkline library found here: https://omnipotent.net/jquery.sparkline/#s-about and I’ve been looking into this.
Initially I ran into some difficulty with the limited number of options available. E.g. you can’t specify a start value for the chart, only an end value (although I later discovered that there is an undocumented setting for this in the source code), and individual blocks also don’t have start and end points but instead are single points that take their start value from the previous block. Also, data needs to be added in reverse order or things don’t display properly.
I must admit that trying to figure out how to hack about with our data to fit it in as the library required gave me a splitting headache and I eventually abandoned the library and wondered whether I could just make a ‘mini’ version using the D3 timeline plugin I was already using. After all there are lots of example of single bar timelines in the documentation: https://github.com/denisemauldin/d3-timeline. However, after more playing around with this library I realised that it just wasn’t very well suited to being shrunk to an inline size. Things started to break in weird ways when the dimensions were made very small and I didn’t want to have to furtle about with the library’s source code too much, after already having had to do so for the main timeline.
So, after taking some ibuprofen I returned to the ‘Bullet Chart’ and finally managed to figure out how to make our data work and get it all added in reverse order. As the start has to be zero, I made the end of the chart 1000, and all data has 1000 years taken off it. If I hadn’t done this then OE would have started midway through the chart. Individual years were not displaying due to being too narrow so I’ve added a range of 50 years on to them, which I later reduced to 20 years after feedback from Fraser. I also managed to figure out how to reduce the thickness of the bar running along the middle of the visualisation. This wasn’t entirely straightforward as the library uses HTML Canvas rather than SVG. This means you can’t just view the source of the visualisation using the browser’s ‘select element’ feature and tinker with it. Instead I had to hack about with the library’s source code to change the coordinates of the rectangle that gets created. Here’s an example of where I’d got to during the week:
I positioned the timelines to the right of each word’s section, next to the magnifying glass. There’s a tooltip that displays the fulldate field on hover. I figured out how to position the ‘line’ at the bottom of the timeline, rather in the middle, and I’ve disabled highlighting of sections on mouse over and have made the background look transparent. It’s not, actually. I tried this but the ‘white’ blocks actually cover up unwanted sections of the other colour so setting things to transparent messed up the timeline. Instead the code works out if the row is odd or even and grabs the row colour based on this. I had to remove the shades of grey from the subcat backgrounds to make this work. But actually I think the page looks better without the subcats being in grey. So, here is an example of the mini timelines in the category test page:
I think it’s looking pretty good. The only downside is these mini-timelines sort of make my original full timeline a little obsolete.
I worked on a few other projects this week as well. I sorted out access to the ‘Editing Burns’ website for a new administrator who has started, and I investigated some strange errors with the ‘Seeing Speech’ website whereby the video files were being blocked. It turned out to be down to a new security patch that had been installed on the server and after Chris updated this things started working again.
I also met with Megan Coyer to discuss her ‘Hogg in Fraser’s Magazine’ project. She had received XML files containing OCR text and metadata for all of the Fraser’s Magazine issues and wanted me to process the files to convert them to a format that she and her RA could more easily use. Basically she wanted the full OCR text, plus the Record ID, title, volume, issue, publication date and contributor information to be added to one Word file.
There were 17,072 XML files and initially I wrote a script that grabbed the required data and generated a single HTML file, that I was then going to convert into DOCX format. However, the resulting file was over 600Mb in size, which was too big to work with. I decided therefore to generate individual documents for each volume in the data. This results in 81 files (including one for all of the XML files that don’t seem to include a volume). The files are a more manageable size, but are still thousands of pages long in Word. This seemed to suit Megan’s needs and I moved the Word files to her shared folder for her to work with.
Monday this week was the May Day holiday, so it was a four-day week for me. I divided my time primarily between REELS and Linguistic DNA and updates to the Historical Thesaurus timeline interface. For REELS I contacted Chris Fleet at the NLS about using one of their base maps in our map interface. I’d found one that was apparently free to use and I wanted to check we had the details and attribution right. Thankfully we did, and Chris very helpfully suggested another base map of theirs that we might be able to incorporate too. He also pointed me towards an amazing crowdsourced resource that they had set up that has gathered more than 2 million map labels from the OS six-inch to the mile, 1888-1913 maps (see http://geo.nls.uk/maps/gb1900/). It’s very impressive.
I also tackled the issue of adding icons to the map for classification codes rather than just having coloured spots. This is something I’d had in mind from the very start of the project, but I wasn’t sure how feasible it would be to incorporate. I started off by trying to add in Font Awesome icons, which is pretty easy to do with a Leaflet plugin. However, I soon realised that Font Awesome just didn’t have the range of icons that I required for things like ‘coastal’, ‘antiquity’ ‘ecclesiastical’ and the like. Instead I found some more useful icons: https://mapicons.mapsmarker.com/category/markers/. The icons are released under a Creative Commons license and are free to use. Unfortunately they are PNG rather than SVG icons, so they won’t scale quite as nicely, but they don’t look too bad on an iPad’s ‘retina’ display, so I think they’ll do. I created custom markers for each icon and gave them additional styling with CSS. I updated the map legend to incorporate them as well, and I think they’re looking pretty good. It’s certainly easier to tell at a glance what each marker represents. Here’s a screenshot of how things currently look (but this of course still might change):
I also slightly changed all of the regular coloured dots on the map to give them a dark grey border, which helps them stand out a bit more on the maps, and I have updated the way map marker colours are used for the ‘start date’ and ‘altitude’ maps. If you categorise the map by start date the marker colours now have a fixed gradient, ranging from dark blue for 1000-1099 to red for after 1900 (the idea being things that are in the distant past are ‘cold’ and more recent things are still ‘hot’). Hopefully this will make it easier to tell at a glance which names are older and which are more recent. Here’s an example:
For the ‘categorised by altitude’ view I made the fixed gradient use the standard way of representing altitude on maps – ranging from dark green for low altitude, through browns and dark reds for high altitude, as this screenshot shows:
From the above screenshots you can see that I’ve also updated the map legend so that the coloured areas match the map markers, and I also added a scale to the map, with both metric and imperial units shown, which is what the team wanted. There are still some further changes to be made, such as updating the base maps, and I’ll continue with this next week.
For Linguistic DNA and the Historical Thesaurus I met with Marc and Fraser on Wednesday morning to discuss updates. We agreed that I would return to working on the sparklines in the next few weeks and I received a few further suggestions regarding the Historical Thesaurus timeline feature. Marc has noticed that if your cursor was over the timeline then it wasn’t possible to scroll the page, even though a long timeline might go off the bottom of the screen. If you moved your cursor to the sides of the timeline graphic scrolling worked normally, though. It turned out that the SVG image was grabbing all of the pointer events so the HTML in the background never knew the scroll event was happening. By setting the SVG to ‘pointer-events: none’ in the CSS the scroll events cascade down to the HTML and scrolling can take place. However, this then stops the SVG being able to process click events, meaning the tooltips break. Thankfully adding in ‘pointer-events: all’ to the bars, spots and OE label fixes this, apart from one oddity: if your cursor is positioned over a bar, spot or the OE label and you try to scroll then nothing happens. This is a relatively minor thing, though. I also updated the timeline font so that it uses the font we use elsewhere on the site.
I also made the part of speech in the timeline heading lower-case to match the rest of the site, and I also realised that the timeline wasn’t using the newer versions of the abbreviations we’d decided upon (e.g. ‘adj.’ rather than ‘aj.’) so I updated this, and also added in the tooltip. Finally, I addressed another bug whereby very short timelines were getting cut off. I added extra height to the timeline when there are only a few rows, which stops this happening.
I had a Skype meeting with Mike Pidd had his team at DHI about the EEBO frequency data for Linguistic DNA on Wednesday afternoon. We agreed that I would write a script that would output the frequency data for each Thematic Heading per decade as a series of CSV files that I would then send on to the team. We also discussed the Sparkline interface and the HT’s API a bit more, and I gave some further explanation as to how the sparklines work. After the meeting I started work on the export script, which does the following:
- It goes through every thematic heading that is up to the third hierarchy down.
- If the heading in question is a third level one then all lexemes from any lower levels are added into the output for this level
- Each CSV is given the heading number as a filename, but with dashes instead of colons as colons are bad characters for filenames
- Columns one and two are the heading and title of the thematic heading. This is the same for every row in a file – i.e. words from lower down the hierarchy do not display their actual heading. E.g. words from ‘AA:03:e:01 Volcano’ will display ‘AA:03:e High/rising ground’
- Column 3 contains the word and 4 the POS.
- Column 5 contains the number of senses in the HT. I had considered excluding words that had zero senses in the HT as a means of cutting out a lot of noise from the data, but decided against this in the end, as it would also remove a lot of variant spellings and proper names, which might turn out to be useful at some point. It will be possible to filter the data to remove all zero sense rows at a later date.
- The next 24 columns contain the data per decade, starting at 1470-1479 and ending with 1700-1709
- The final column contains the total frequency count
I started my script running on Thursday, and left my PC on over night to try and get the processing complete. I left it running when I went home at 5pm, expecting to find several hundred CSV files had been outputted. Instead, Windows had automatically installed an update and restarted my PC at 5:30, thus cancelling the script, which was seriously annoying. It does not seem to be possible to stop Windows doing such things, as although there are plenty of Google results about how to stop Windows automatically restarting when installing updates Microsoft changes Windows so often that all the listed ways I’ve looked at no longer work. It’s absolutely ridiculous as it means running batch processes that might take a few days is basically impossible to do with any reliability on a Windows machine.
Moving on to other tasks I undertook this week: I sorted out payment for the annual Apple Developer subscription, which is necessary for our apps to continue to be listed on the App Store. I also responded to a couple of app related queries from an external developer who is making an app for the University. I also sorted out the retention period for user statistics for Google Analytics for all of the sites we host, after Marc asked me to look at this.
I returned to work after my Easter holiday on Tuesday this week, making it another four-day week for me. On Tuesday I spent some time going through my emails and dealing with some issues that had arisen whilst I’d been away. This included sorting out why plain text versions of the texts in the Corpus of Modern Scottish Writing were giving 403 errors (it turned out the server was set up to not allow plain text files to be accessed and an email to Chris got this sorted). I also spent some time going through the Mapping Metaphor data for Wendy. She wanted me to structure the data to allow her to easily see which metaphors continued from Old English times and I wrote a script that gave a nice colour-coded output to show those that continued or didn’t. I also created another script that lists the number (and the details of) metaphors that begin in each 50-year period across the full range. In addition, I spoke to Gavin Miller about an estimate of my time for a potential follow-on project he’s putting together.
The rest of my week was split between two projects: LinguisticDNA and REELS. For LinguisticDNA I continued to work on the search facilities for the semantically tagged EEBO dataset. Chris gave me a test server on Tuesday (just an old desktop PC to add to the several others I now have in my office) and I managed to get the database and the scripts I’d started working on before Easter transferred onto it. With everything set up I continued to add new features to the search facility. I completed the second search option (Choose a Thematic Heading and a specific book to view the most frequent words) which allowa you to specify a Thematic Heading, a book, a maximum number of returned words and whether the theme selection includes lower levels. I also made it so that you can miss out the selection of a thematic heading to bring back all of the words in the specified book listed by frequency. If you do this each word’s thematic heading is also listed in the output, and it’s a useful way of figuring out which thematic headings you might want to focus on.
I also added a new option to both searches 1 and 2 that allows you to amalgamate the different noun and verb types. There are several different types (e.g. NN1 and NN2 for singular and plural forms of nouns) and it’s useful to join these together as single frequency counts rather than having them listed separately.
I also completed search option 3 (Choose a specific book to view the most frequent Thematic Headings). This allows the user to select a book from an autocomplete list and optionally provide a limit to the returned headings. The results display the thematic headings found in the book listed in order of frequency. The returned headings are displayed as links that perform a ‘search 2’ for the heading in the book, allowing you to more easily ‘drill down’ into the data. For all results I have added in a count column, so you can easily see how many results are returned or reference a specific result, and I also added titles to the search results pages that tell you exactly what it is you’ve searched for. I also created a list of all thematic headings, as I thought it might be handy to be able to see what’s what. When looking at this list you can perform a ‘search 1’ for any of the headings by clicking on one, and similarly, I created an option to list all of the books that form the dataset. This list displays each book’s ID, author, title, terms and number of pages, and you can perform a ‘search 3’ for a book by clicking on its ID.
On Friday I participated in the Linguistic DNA project conference call, following which I wrote a document describing the EEBO search facilities, as project members outside of Glasgow can’t currently access the site I’ve put together.
For REELS I continued to work on the public interface for the place-name data, which included the following:
- The number of returned place-names is now displayed in the ‘you searched for…’ box
- The textual list of results now features two buttons for each result, one to view the record and one to view the place-name on the map. I’m hoping the latter might be quite useful as I often find an interesting name in the textual list and wonder which dot on the map it actually corresponds to. Now with one click I can find it.
- Place-name labels on the map now appear when you zoom in past a certain level (currently set to zoom level 12). Note that only results rather than grey spots get the visible labels as otherwise there’s too much clutter and the map takes ages to load too.
- The record page now features a map with the place-name at the centre, and all other place-names as grey dots. The marker label is automatically visible.
- Returning back to the search results from a record when you’ve done a quick search now works – previously this was broken.
- The map zoom controls have been moved to the bottom right, and underneath them is a new icon for making the map ‘full screen’. Pressing on this will make the map take up the whole of your screen. Press ‘Esc’ or on the icon again to return to the regular view. Note that this feature requires a modern web browser, although I’ve just tested in in IE on Windows 10 and it works. Using full screen mode makes working with the map much more pleasant. Note, however, that navigating away form the map (e.g. if you click a ‘view record’ button) will return you to the regular view.
- There is a new ‘menu’ icon in the top-left of the map. Press on this and a menu slides out from the left. This presents you with options to change how the results are categorised on the map. In addition to the ‘by classification code’ option that has always been there, you can now categorise and colour code the markers by start date, altitude and element language. As with code, you can turn on and off particular levels using the legend in the top right. E.g. if you only want to display markers that have an altitude of 300m or more.
As Friday this week was Good Friday this was a four-day week for me. I’ll be on holiday all next week too, so I won’t be posting for a while. I focussed on two projects this week: REELS and Linguistic DNA. For REELS I continued to implement features for the front-end of the website, as I had defined in the specification document I wrote a few months ago. I spent about a day working on the Element Glossary feature. First of all I had to update the API in order to add in the queries required to bring back the place-name element data in a format that the glossary required. This included not just bringing back information about the elements (e.g. language, part of speech) but also adding in queries that brought back the number of available current place-names and historical forms that the element appears in. This was slightly tricky, but I managed to get the queries working in the end, and my API now spits out some nicely formatted JSON data for the elements that the front-end can use. With this in place I could create the front-end functionality. The element glossary functions as described in my specification document, displaying all available information about the element, including the number of place-names and historical forms it has been associated with. There’s an option to limit the list of elements by language and clicking on an entry in the glossary performs a search for the item, leading through to the map / textual list of place-names. I also embedded IDs in the list entries that allow the list to be loaded at a specific element, which will be useful for other parts of the site, such as the full place-name record.
The full place-name record page was the other major feature I implemented this week, and is really the final big piece of the front-end that needed to be implemented (but having said that there are still many other smaller pieces still to tackle). First of all I updated the API to add in an endpoint that allows you to pass a place-name ID and to return all of the data about the place-name as JSON or CSV data (I still need to update the CSV output to make it a bit more usable, though – currently all data is presented on one long row, with headings in the row above and having this vertically rather than horizontally arranged would make more sense). With the API endpoint in place I then created the page to display all of this data. This included adding in links to allow users to download the data as CSV or JSON, making searchable parts of the data links that lead through to the search results (e.g. parish, classification codes), adding in the place-name elements and links through to the glossary, and adding in all of the historical forms, together with their sources and elements. It’s coming along pretty well, but I still need to work a bit more on the layout (e.g. maybe moving the historical forms to another tab and adding in a map showing the location of the place-name).
For Linguistic DNA I continued to work on the EEBO thematic heading frequency data. Chris is going to set me up with access to a temporary server for my database and queries for this, but didn’t manage to make it available this week, so I continued to work on my own desktop PC. I added in the thematic heading metadata, to make the outputted spreadsheets more easy to understand (i.e. instead of just displaying a thematic heading code (e.g. ‘AA’) the spreadsheet can include the full heading names too (e.g. ‘The World’). I also noticed that we have some duplicate heading codes in the system, which was causing problems when I tried to use the concatenated codes as a primary key. I notified Fraser about this and we’ll have to fix this later. I also integrated all of the TCP metadata, and then stripped out all of the records for books that are not in our dataset, leaving about 25,000 book records. With this in place I will be able to join the records up to the frequency data in order to limit the queries, e.g. based on the year the books were published, or limiting to specific book titles.
I then created a search facility that lets a user query the full 103 million row dataset in order to bring back frequency data for specific thematic headings or years, or books. I created the search form (as you can see below), with certain fields such as thematic heading and book title being ‘autocomplete’ fields, bringing up a list of matching items as you type. You can also choose whether to focus on a specific thematic heading, or to include all lower levels in the hierarchy as well as the one you enter, so for example ‘AA’ will also bring back the data for AA:01, AA:02 etc.
With the form in place I set to work on the queries that will run when the form is submitted. At this stage I still wasn’t sure whether it would be feasible to run the queries in a browser or if they might take hours to execute. By the end of the week I had completed the first query option, and thankfully the query only took a few seconds to execute so it will be possible to make the query interface available to researchers to use themselves via their browser. It’s now possible to do things like find the top 20 most common words within a specific thematic heading for one decade and then compare these results with the output for another decade, which I think will be hugely useful. I still need to implement the other two search types as shown in the above screenshot, and get all of this working on a server rather than my own desktop PC, but it’s all looking rather promising.
With the strike action over (for now, at least) I returned to a full week of work, and managed to tackle a few items that had been pending for a while. I’d been asked to write a Technical Plan for an AHRC application for Faye Hammill in English Literature, but since then the changeover from four-page, highly structured Technical Plans to two-page more free-flowing Data Management Plans has taken place. This was a good opportunity to write an AHRC Data Management Plan, and after following the advice on the AHRC website(http://www.ahrc.ac.uk/documents/guides/data-management-plan/) and consulting the additional documentation on the DCC’s DMPonline tool (https://dmponline.dcc.ac.uk/) I managed to write a plan that covered all of the points. There are still some areas where I need further input from Faye, but we do at least have a first draft now.
I also created a project website for Anna McFarlane’s British Academy funded project. The website isn’t live yet, so I can’t include the URL here, but Anna is happy with how it looks, which is good. After sorting that out I then returned to the REELS project. I created the endpoints in the API that would allow the various browse facilities we had agreed upon to function, and then built these features in the front-end. It’s now possible to (for example) list all sources and see which has the most place-names associated with it, or bring up a list of all of the years in which historical forms were first attested.
I spent quite a bit of time this week working on the extraction of words and their thematic headings from EEBO for the Linguistic DNA project. Before the strike I’d managed to write a script that went through a single file and counted up all of the occurrences of words, parts of speech and associated thematic headings, but I was a little confused that there appeared to be thematic heading data in column 6 and also column 10 of the data files. Fraser looked into this and figured out that the most likely thematic heading appeared in column 10, while other possible ones appeared in column 6. This was a rather curious way to structure the data, but once I knew about it I could set my script to focus on column 10, as we’re only interested in the most likely thematic heading.
I updated my script to insert data into a database rather than just hold things temporarily in an array, and I also wrapped the script in another function that then applied the processing to every file in a directory rather than just a single file. With this in place I set the script running on the entire EEBO directory. I was unsure whether running this on my desktop PC would be fast enough, but thankfully the entire dataset was processed in just a few hours.
My script finished processing all 14590 files that I had copied from the J drive to my local PC, resulting in whopping 70,882064 rows entered into my database. Everything seemed to be going very well, but Fraser wasn’t sure I had all of the files, and he was correct. Having checked the J drive, there were 25,368 items, so when I had copied the files across the process must have silently failed at some point. And even more annoyingly it didn’t fail in an orderly manner. E.g. the earliest file I have on my PC is A00018 while there are several earlier ones on the J drive.
I copied all of the files over again and decided that rather then dropping the database and started from scratch I’d update my script to check to see whether a file had already been processed, meaning that only the missing 10,000 or so would be dealt with. However, in order to do this the script would need to query a 70 million row database for the ‘filename’ column, which didn’t have an index. I began the process of creating an index, but indexing 70 million rows took a long time – several hours, in fact. I almost gave up and inserted all the data again from scratch, but the thing is I knew I would need this index in order to query the data anyway, so I decided to persevere. Thankfully the index finally finished building and I could then run my script to insert the missing 10,000 files, a process that took a bit longer as the script now had to query the database and also update the index as well as insert the data. But finally all 25,368 files were processed, resulting in 103,926,008 rows in my database.
The script and the data are currently located on my desktop PC, but if Fraser and Marc want to query it I’ll need to get this migrated to a web server of some sort, so I contacted Chris about this. Chris said he’d sort a temporary solution out for me, which is great. I then set to work writing another script that would extract summary information for the thematic headings and insert this into another table. After running the script this table now contains a total count of each word / part of speech / thematic heading across the entire EEBO collection. Where a lemma appears with multiple parts of speech these are treated as separate entities and are not added together. For example, ‘AA Creation NN1’ has a total count of 4609 while ‘AA Creation NN2’ has a total count of 19, and these are separate rows in the table.
Whilst working with the data I noticed that a significant amount of it is unusable. Of the almost 104 million rows of data, over 20 million have been given the heading ’04:10’ and a lot of these are words that probably could have been cleaned up before the data was fed into the tagger. A lot of these are mis-classified words that have an asterisk or a dash at the start. If the asterisk / dash had been removed then the word could have been successfully tagged. E.g. there are 88 occurrences of ‘*and’ that have been given the heading ’04:10’ and part of speech ‘FO’. Basically about a fifth of the dataset is an unusable thematic heading, and much of this is data that could have been useful if the data had been pre-processed a little more thoroughly.
Anyway, after tallying up the frequencies across all texts I then wrote a script to query this table and extract a ‘top 10’ list of lemma / pos combinations for each of the 3,972 headings that are used. The output has one row per heading and a column for each of the top 10 (or less if there are less than 10). This currently has the lemma, then the pos in brackets and the total frequency across all 25,000 texts after a bar, as follows: christ (NP1) | 1117625. I’ve sent this to Fraser and once he gets back to me I’ll proceed further.
In addition to the above big tasks, I also dealt with a number of smaller issues. Thomas Widmann of SLD had asked me to get some DSL data from the API for him, so I sent that on to him. I updated the ‘favicon’ for the SPADE website, I fixed a couple of issues for the Medical Humanities Network website, and I dealt with a couple of issues with legacy websites: For SWAP I deleted the input forms as these were sending spam to Carole. I also fixed an encoding issue with the Emblems websites that had crept in when the sites had been moved to a new server.
I also heard this week that IT Services are going to move all project websites to HTTPS from HTTP. This is really good news as Google has started to rank plain HTTP sites lower than HTTPS sites, plus Firefox and Chrome give users warnings about HTTP websites. Chris wanted to try migrating one of my sites to HTTPS and we did this for the Scots Corpus. There were some initial problems with the certificate not working for the ‘www’ subdomain but Chris quickly fixed this and everything appeared to be working fine. Unfortunately, although everything was fine within the University network, the University’s firewall was blocking HTTPS requests from external users, meaning no-one outside of the University network could access the site. Thankfully someone contacted Wendy about this and Chris managed to get the firewall updated.
I also did a couple of tasks for the SCOSYA project, and spoke to Gary about the development of the front-end, which I think is going to need to start soon. Gary is going to try and set up a meeting with Jennifer about this next week. On Friday afternoon I attended a workshop about digital editions that Sheila Dickson in German had organised. There were talks about the Cullen project, the Curious Travellers project, and Sheila’s Magazin zur Erfahrungsseelenkunde project. It was really interesting to hear about these projects and their approaches to managing transcriptions.
This was the third week of the strike action and I therefore only worked on Friday. I started the day making a couple of further tweaks to the ‘Storymap’ for the RNSN project. I’d inadvertently uploaded the wrong version of the data just before I left work last week, which meant the embedded audio players weren’t displaying, so I fixed that. I also added a new element language to the REELS database and added the new logo to the SPADE project website (see http://spade.glasgow.ac.uk/).
With these small tasks out of the way I spent the rest of the day on Historical Thesaurus and Linguistic DNA duties. For the HT I had previously created a ‘fixed’ header that appears at the top of the page if you start scrolling down, so you can always see what it is you’re looking at, and also quickly jump to other parts of the hierarchy. You can also click on a subcategory to select it, which adds the subcategory ID to the URL, allowing you to quickly bookmark or cite a specific subcategory. I made this live today, and you can test it out here: http://historicalthesaurus.arts.gla.ac.uk/category/#id=157035. I also fixed a layout bug that was making the quick search box appear in less than ideal places on certain screen widths and I also updated the display of the category and tree on narrow screens: Now the tree is displayed beneath the category information and a ‘jump to hierarchy’ button appears. This in combination with the ‘top’ button makes navigation much more easy on narrow screens.
I then started looking at the tagged EEBO data. This is a massive dataset (about 50Gb of text files) that contains each word on a subset of EEBO that has been semantically tagged. I need to extract frequency data from this dataset – i.e. how many times each tag appears both in each text and overall. I have initially started to tackle this using PHP and MySQL as these are the tools I know best. I’ll see how feasible it is to use such an approach and if it’s going to take too long to process the whole dataset I’ll investigate using parallel computing and shell scripts, as I did for the Hansard data. I managed to get a test script working that managed to go through one of the files in about a second, which is encouraging. I did encounter a bit of a problem processing the lines, though. Each line is tab delimited and rather annoyingly, PHP’s fgetcsv function doesn’t treat ‘empty’ tabs as separate columns. This was giving me really weird results as if a row had any empty tabs the data I was expecting to appear in columns wasn’t there. Instead I had to use the ‘explode’ function on each line, splitting it up by the tab character (\t), and this thankfully worked. I still need confirmation from Fraser that I’m extracting the right columns, as strangely there appear to be thematic heading codes in multiple columns. Once I have confirmation I’ll be able to set the script running on the whole dataset (once I’ve incorporated the queries for inserting the frequency data into the database I’ve created).
This week was the second week of the UCU strike action, meaning I only worked on Thursday and Friday. Thing were further complicated by the heavy snow, meaning the University was officially closed on Wednesday to Friday. However, I usually work from home on Thursdays anyway, so just worked as I would normally do. And on Friday I travelled into work without too much difficulty in order to participate in some meetings that had been scheduled.
I spent most of Thursday working on the REELS project, making tweaks to the database and content management system and working on the front end. I updated the ‘parts of speech’ list that’s used for elements, adding in ‘definite article’ and ‘preposition’, and also added in the full text in addition to the abbreviations to avoid any confusion. Last week I added ‘unknown’ to the elements database, with ‘na’ for the language. Carole pointed out that ‘na’ was appearing as the language when ‘unknown’ was selected, which it really shouldn’t do, so I updated the CMS and the front-end to ensure that this is hidden. I also wrote a blog post about the technical development of the front end. It’s not gone live yet but once it has I’ll link through to it. I also updated the quick search so that it only searches current place-names, elements and grid references, and I’ve fixed the ‘altitude’ field in the advanced search so that you can enter more than 4 characters into it.
In addition to this I spent some of the day catching up with emails and I also gave Megan Coyer detailed instructions on how to use Google Docs to perform OCR on an image based PDF file. This is a pretty handy trick to know and it works very well, even on older printed documents (so long as the print quality is pretty good). Here’s how you go about it:
You need to go to Google Drive (https://drive.google.com) then drag and drop the PDF into there, which keeps it as a PDF. Then right click on the thumbnail of the PDF and select ‘Open With…’ and then select Google Docs and it converts it into text (a process which can take a while depending on the size of your PDF). You can then save the file, download it as a Word file etc.
After trudging through the snow on Friday morning I managed to get into my office for 9am, and worked through until 5 without a lunch break as I had so much to try and do. At 10:30 I had a meeting with Jane Stuart-Smith and Eleanor Lawson about revamping the Seeing Speech website. I spent about an hour before this meeting going through the website and writing down a list of initial things I’d like to improve, and during our very useful two-hour meeting we went through this list, and discussed some other issues as well. It was all very helpful and I think we all have a good idea of how to proceed with the developments. Jane is going to try and apply for some funding to do the work, so it’s not something that will be tackled straight away, but I should be able to make good progress with it once I get the go-ahead.
I went straight from this meeting to another one with Marc and Fraser about updates to the Historical Thesaurus and work on the Linguistic DNA project. This was another useful and long meeting, lasting at least another two hours. I can’t really go into much detail about what was discussed here, but I have a clearer idea now of what needs to be done for LDNA in order to get frequency data from the EEBO texts, and we have a bit of a roadmap for future Historical Thesaurus updates, which is good.
After these meetings I spent the rest of the day working on an updated ‘Storymap’ for Kirsteen’s RNSN project. This involved stitching together four images of sheet music to use as a ‘map’ for the story, updating the position of all of the ‘pins’ so they appeared in the right places, updating the images used in the pop-ups, embedding some MP3 files in the pop-ups and other such things. Previously I was using the ‘make a storymap’ tools found here: https://storymap.knightlab.com/ which meant all our data was stored on a Google server and referenced files on the Knightlab servers. This isn’t ideal for longevity, as if anything changes either at Google or Knightlab then our feature breaks. Also, I wanted to be able to tweak the code and the data. For these reasons I instead downloaded the source code and added it to our server, and grabbed the JSON datafile generated by the ‘make a’ tool and added this to our server too. This allowed me to update the JSON file to make an HTML5 Audio player work in the pop-ups and it will hopefully allow me to update the code to make images in the pop-ups clickable too.
I returned to a full week of work this week, after the horribleness of last week’s flu. I was still feeling pretty exhausted by the end of each working day, but managed to make it through the week. I’d had several meetings scheduled for last week, and rescheduled them all for this week. On Monday I met with Kirsteen and Brianna to discuss the website for the Romantic National Song Network. I’d been working on some interactive stories, on timeline based, the other map based and we talked about how we were going to proceed with these. The team seem pretty happy with how things are developing, and the next step will be to take the proof of concept that I created and add in some higher resolution images, more content and try to make the overall interface a bit larger so as to enable embedded images to be viewed more clearly. On Tuesday I met with Faye Hammill in English Literature to discuss a project she is currently putting together with colleagues at the University of Birmingham. I have agreed to write a technical plan for this project, although it’s still not clear exactly when the AHRC are going to replace the technical plans with their new data management plans. We also discussed a couple of older project websites she has that are currently based at Strathclyde University but will need to be migrated to Glasgow. The sites are currently ASP based and we’d need to migrate them to something else as we don’t support ASP here.
On Wednesday I had three meetings. The first was with Honor Riley of The People’s Voice project. This project launched on Thursday so we met to discuss making all of the online resources live. This included the database of poems I had been developing, which can now be viewed here: http://thepeoplesvoice.glasgow.ac.uk/poems/. I spent some time on Wednesday making final tweaks to the site, and attended the project’s launch event, which lasted all day Thursday. It was an interesting event, held at the Trades Hall in the Merchant City. It was great to see the online resources being launched at the event, and it was also good to learn more about the subject and hear the various talks that took place. The event concluded with a performance of some of the political songs by Bill Adair, which was excellent.
The remaining meetings I had on Wednesday were with Matthew Creasey and Megan Coyer. Matthew has an AHRC leadership fellowship starting up soon and I’m going to help him put a project website together. This will include an online resource, a sort of mini digital edition of some poems. Megan wanted to discuss some potential research tools that might be used to help her in her studying of British periodicals, specifically tools that might help with annotation and note taking. We discussed a few options and considered how a new tool might be developed, if there was a gap in the market. Developing such a tool would not be something I’d be able to manage myself, though. My final meeting of the week was with Stuart Gillespie on Friday. I’d put together a website that will accompany Stuart’s recent publication, and we met to discuss some final tweaks to the website and to discuss how to handle updates to it in future years. The website is now currently available here: http://www.nrect.glasgow.ac.uk/
Other than attending these various meetings and the launch event on Thursday, I managed to squeeze in some other work too. I had an email conversation with Thomas Widmann of the DSL about the API that was developed for the DSL website, and I also helped Ann Ferguson to get access to the WordPress version of the DSL website that I created in November last year. I also spent a bit of time updating all of the WordPress sites I manage, as yet another new version of WordPress had been released (the third so far this year, which is a bit ridiculous). I’d had an email from someone at the Mitchell Library to say that some of the images on TheGlasgowStory weren’t working so I spent a small amount of time investigating and fixing this. It turned out that some images had upper case extensions (JPG instead of jpg) and as Linux is case sensitive the images weren’t getting found. I also had an email chat with Fraser about some of the outstanding work that needs to be done for the Linguistic DNA project. We’re going to meet with Marc in the next few weeks to discuss this further. Fraser also gave me access to the tagged EEBO resource, from which I will need to extract some frequency data.
I spent the remainder of the week working on the front end and API for the REELS project. I managed to complete several new endpoints for accessing the data in the API. The most important of these was the advanced search endpoint, which allows any combination of up to 16 different fields to be submitted in order to return results as JSON or CSV data. I also created other endpoints that will be used for autocomplete features and lists of things, such as sources, parishes, classification codes and elements. With all of this in place I could start working on the actual advanced search form in the front end, and although I haven’t managed to complete this yet I am making pretty good progress with it. Hopefully I’ll have this completed before the REELS team meeting on Tuesday next week.
On Monday this week I attended the Corpus Linguistics in Scotland event, which took place in the STELLA lab on the ground floor of the building I work in, which was very handy. It was a useful event to attend, as the keynote talk was about the SPADE project, which I’m involved with, and it was very helpful to listen to an overview of the project and also to get a glimpse at some of the interesting research that is already going on in the project. The rest of the day was split into two short paper sessions, one about corpus linguistics and the arts and humanities and the other about medical humanities. There were some interesting talks in both sessions and it was great to hear a little about some of the research that’s going on at the moment. It was also good to speak to some of the other attendees at the event, including Rhona Alcorn from SLD, Joanna Copaczyk, who I’m currently helping out with a research proposal, and Stevie Barrett from DASG.
I spent a lot of the rest of the week involved in research proposals. I’d been given another AHRC technical review to do, and this one was a particularly tricky one to get right, which took rather a lot of time. I also started working on a first draft of a Technical Plan for Joanna’s project. I read through all of the materials she’d previously sent me and spent some time thinking through some of the technical implications. I started to write the plan and completed the first couple of sections, by which point I had another series of questions to fire off to Joanna. I also spoke to Graeme about some TEI XML and OCR issues, wondering if he had encountered a similar sort of workflow in a previous project. Graeme’s advice was very helpful, as it usually is. I hope to get a first draft of the plan completed early next week. On Friday I had an email from the AHRC to say that my time as a technical reviewer would end at the end of the month. I have apparently been a technical reviewer for three years now. They also stated that from February next year the AHRC will be dropping Technical Plans and the technical review process. Instead they will just have a data management plan and will integrate the reviewing of technical details within the more general review process. I’m in two minds about this. On the one hand it is clear that the AHRC just don’t have enough technical reviewers which makes having such a distinct focus on the technical aspects of reviews difficult to sustain. But on the other hand, I worry that Arts and Humanities reviewers who are experts in a particular research area may lack the technical knowledge to ascertain whether a research proposal is at all technically feasible, which will almost certainly result in projects getting funded that are simply not viable. It’s going to be interesting to see how this all works out, and also to see how the new data management plans will be structured.
On Friday afternoon I attended a Skype call for the Linguistic DNA project, along with Marc and Fraser. It was good to hear a bit more about how the project is progressing, and to be taken through a presentation about some of the project’s research that the team in Sheffield are putting together. I’m afraid I didn’t have anything to add to the proceedings, though, as I haven’t done any work for the project since the last Skype meeting. There doesn’t really seem to be anything anyone wants me to do for the project at this stage.
Also this week I had a phone conversation with Pauline Mackay about collaborative tools that might be useful for the Burns project to use. I suggested Basecamp as a possible tool, as I know this has been used successfully by others in the School, for example Jane Stuart-Smith has used it to keep track of several projects. I also had a chat with Luca about some work he’s been doing on topic modelling and sentiment analysis, which all sounds really interesting. Hopefully I can arrange another meeting of Arts developers in the new year and Luca can tell us all a bit more about this. I also spent a bit of time updating the Romantic National Song Network website for Kirsteen McCue, and I read through an abstract for a paper Fraser is submitting that will involve the sparklines we created for the Historical Thesaurus.
It was another week of working on fairly small tasks for lots of different projects. I helped Gerry McKeever to put the finishing touches to his new project website, and this has now gone live and can be accessed here: http://regionalromanticism.glasgow.ac.uk/. I also spent some further time making updates to the Burns Paper Database website for Ronnie Young. This included adding in a site menu to facilitate navigation, adding a subheader to the banner, creating new pages for ‘about’ and ‘contact’, adding some new content, making repositories appear with their full names rather than acronyms, updating the layout of the record page and tweaking how the image pop-up works. It’s all pretty much done and dusted now, although I can’t share the URL as the site is password protected due to the manuscript images being under copyright restrictions.
I spent about a day this week on AHRC review duties and also spent some time working on the new interface for Kirsteen McCue’s ‘Romantic National Song Network’ project website. This took up a fair amount of time as I had to try out a few different designs, work with lots of potential images, set up a carousel, and experiment with fonts for the site header. I’m pretty pleased with how things are looking now, although there are four different font styles that we still need to choose one from.
I had a couple of conference calls and a meeting with Marc and Fraser about the Linguistic DNA project. I met with Marc and Fraser first, in order to go over the work Fraser is currently doing and how my involvement in the project might proceed. Fraser and I then had a Skype call with Iona and Seth in Sheffield about the work the researchers are currently doing and some of the issues they are encountering when dealing with the massive dataset they’re working with. After the call Fraser sent me a sample of the data, which really helped me to understand some of the technical issues that are cropping up. On Friday afternoon the whole project had a Skype call. This included the DHI people in Sheffield and it was useful to hear something about the technical work they are currently doing.
I had a couple of other meetings this week too. On Wednesday morning I had a meeting with Jennifer Smith about a new pilot project she’s putting together in order to record Scots usage in schools. We talked through a variety of technical solutions and I was able to give some advice on how the project might be managed from a technical point of view. On Wednesday afternoon I had a meeting for The People’s Voice project, at which I met with new project RA, who has taken over from Michael Shaw as he’s now moved to a different institution. I helped the new RA get up to speed with the database and how to update the front-end.
Also this week I had an email conversation with the SPADE people about how we will set up a server for the project’s infrastructure at Glasgow. I’m going to be working on this the week after next. I also made a few further updates to the DSL website and had a chat with Thomas Widmann about a potential reworking of some of the SLD’s websites.
There’s not a huge amount more to say about the work I did this week. I was feeling rather unwell all week and it was a bit of a struggle getting through some days during the middle of the week, but I made it through to the end. I’m on holiday all of next week so there won’t be an update from me until the week after.