Week Beginning 26th March 2018

As Friday this week was Good Friday this was a four-day week for me.  I’ll be on holiday all next week too, so I won’t be posting for a while.  I focussed on two projects this week: REELS and Linguistic DNA.  For REELS I continued to implement features for the front-end of the website, as I had defined in the specification document I wrote a few months ago.  I spent about a day working on the Element Glossary feature.  First of all I had to update the API in order to add in the queries required to bring back the place-name element data in a format that the glossary required.  This included not just bringing back information about the elements (e.g. language, part of speech) but also adding in queries that brought back the number of available current place-names and historical forms that the element appears in.  This was slightly tricky, but I managed to get the queries working in the end, and my API now spits out some nicely formatted JSON data for the elements that the front-end can use.  With this in place I could create the front-end functionality.  The element glossary functions as described in my specification document, displaying all available information about the element, including the number of place-names and historical forms it has been associated with.  There’s an option to limit the list of elements by language and clicking on an entry in the glossary performs a search for the item, leading through to the map / textual list of place-names.  I also embedded IDs in the list entries that allow the list to be loaded at a specific element, which will be useful for other parts of the site, such as the full place-name record.

The full place-name record page was the other major feature I implemented this week, and is really the final big piece of the front-end that needed to be implemented (but having said that there are still many other smaller pieces still to tackle).  First of all I updated the API to add in an endpoint that allows you to pass a place-name ID and to return all of the data about the place-name as JSON or CSV data (I still need to update the CSV output to make it a bit more usable, though – currently all data is presented on one long row, with headings in the row above and having this vertically rather than horizontally arranged would make more sense).  With the API endpoint in place I then created the page to display all of this data.  This included adding in links to allow users to download the data as CSV or JSON, making searchable parts of the data links that lead through to the search results (e.g. parish, classification codes), adding in the place-name elements and links through to the glossary, and adding in all of the historical forms, together with their sources and elements.  It’s coming along pretty well, but I still need to work a bit more on the layout (e.g. maybe moving the historical forms to another tab and adding in a map showing the location of the place-name).

For Linguistic DNA I continued to work on the EEBO thematic heading frequency data.  Chris is going to set me up with access to a temporary server for my database and queries for this, but didn’t manage to make it available this week, so I continued to work on my own desktop PC.  I added in the thematic heading metadata, to make the outputted spreadsheets more easy to understand (i.e. instead of just displaying a thematic heading code (e.g. ‘AA’) the spreadsheet can include the full heading names too (e.g. ‘The World’).  I also noticed that we have some duplicate heading codes in the system, which was causing problems when I tried to use the concatenated codes as a primary key.  I notified Fraser about this and we’ll have to fix this later.  I also integrated all of the TCP metadata, and then stripped out all of the records for books that are not in our dataset, leaving about 25,000 book records.  With this in place I will be able to join the records up to the frequency data in order to limit the queries, e.g. based on the year the books were published, or limiting to specific book titles.

I then created a search facility that lets a user query the full 103 million row dataset in order to bring back frequency data for specific thematic headings or years, or books.  I created the search form (as you can see below), with certain fields such as thematic heading and book title being ‘autocomplete’ fields, bringing up a list of matching items as you type.  You can also choose whether to focus on a specific thematic heading, or to include all lower levels in the hierarchy as well as the one you enter, so for example ‘AA’ will also bring back the data for AA:01, AA:02 etc.

With the form in place I set to work on the queries that will run when the form is submitted.  At this stage I still wasn’t sure whether it would be feasible to run the queries in a browser or if they might take hours to execute.  By the end of the week I had completed the first query option, and thankfully the query only took a few seconds to execute so it will be possible to make the query interface available to researchers to use themselves via their browser.  It’s now possible to do things like find the top 20 most common words within a specific thematic heading for one decade and then compare these results with the output for another decade, which I think will be hugely useful.  I still need to implement the other two search types as shown in the above screenshot, and get all of this working on a server rather than my own desktop PC, but it’s all looking rather promising.