Week Beginning 16th February 2015

I spent roughly half this week working on the Scots Thesaurus project. I’d previously developed a version of a tool that queried the DSL XML files stored within a BaseX database and Susan wanted the facility to be able to select DSL entries and automatically associate them with thesaurus categories. This wasn’t really possible because the DSL XML data I was working with didn’t include unique identifiers for each dictionary entry. At my meeting with the SND people a couple of weeks ago Peter Bell suggested that I use the API that powers the new DSL website in order to get the data, as the data stored within this is properly separated out with unique identifiers. The API allows the same search that the tool was previously carrying out on the XML files (searching the full text of each entry without the citations) so I decided to rebuild the tool to use the API data rather than the older XML data. This has really helped to get a much more useful and usable tool for managing categories and words. The new tool I created this week allows staff to do the following:

  1. Browse the categories that have already been created

I’ve taken a lot of the code that powers the HTE for this. Users can browse up and down the hierarchy, view subcategories, view words associated with categories and follow links to the DSL entry for each word. Users can also jump to a specific category by entering its number and selecting the part of speech. Users can also select a category in order to add words to it (see below).

  1. Add a new category

Using this facility, staff users can add new main categories and subcategories to the database.

  1. Add words to a category

I’ve updated the ‘search for words’ facility that I’d previously created. If a staff user has selected a category to add words to from the ‘browse category’ page and then visits the ‘add words’ page s/he can search the HTE for words and categories, view and edit the search terms that were created and then search for these terms in the DSL. The search queries the DSL API and returns results split into separate lists for DOST and SND. The user can follow links from each search result to the entries on the DSL website. The user can also tick a checkbox beside one or more results and then press the ‘add selected words to category’ button at the bottom of the page to automatically associate selected words with the category that was chosen. These words are then stored in the database. With a category selected if the user performs a word search that returns any words that are already associated with the category these are listed with their checkboxes already ticked. If these are ‘unticked’ and the user presses the ‘add selected words…’ button the association between the word and the category will be removed.

I think these facilities should allow words and categories for the project to be managed pretty effectively. I’ll need to hear back from Susan before I do anything further with the tool.

Other than Friday, the rest of my week was spent on Mapping Metaphor duties. I spent a bit of time going through the feedback from the two testing sessions that Ellen had arranged last week and noting any issues that had been raised that were not already in my ‘bug list’ document. There were only a few of these issues, which is relatively encouraging. The rest of my time was spent developing a content management system for the project. I created a table to hold staff user accounts, created the logging in and out logic (taken from previous projects so not particularly tricky to implement) and created the facilities that Ellen and Wendy had requested. The staff interface presents the staff user with a view of the metaphor categories pretty much identical to the general ‘browse’ interface. From this the user can select a category and this brings up a page listing the category name, its descriptors and a table containing all of the connections this category has to other categories. From this page the user can access a form where the category name and descriptors can be updated.

There is also another form where a new metaphorical connection involving the category can be created. This presents the selected category as ‘Category 1’ and has a drop-down list containing all of the categories, enabling the connection to be established. The user can also specify direction and strength and also supply a ‘first lexeme’ and other lexemes associated with the connection. These lexemes have to be supplied as their unique identifiers from the historical thesaurus database. Upon creation the system creates the new connection and generates a ‘first date’ for the connection based on the ‘first lexeme’ ID that was supplied.

Using the table of metaphor connections a staff user may also choose to edit a connection (which presents them with a form similar to the ‘add new’ form but with details already filled in) and a ‘delete connection’ function. This option doesn’t actually completely remove the connection from the database but copies the connection and its associated information (e.g. sample lexemes) to a newly created ‘deleted’ table, thus enabling accidentally deleted connections to be retrieved easily. I’ve sent Ellen the details and she’s going to test it all out so I’ll see if any further refinements are required after that.

On Friday this week I attended a workshop on Statistics in Corpus Linguistics at Stirling University. I was led by Lancaster University and it was a very interesting and useful day. Statistics is not an area I know much about and as I’m involved with the Scots Corpus and other projects that have a statistical element to them I figured it would be useful to learn a bit more about statistical methods. There was a two hour lecture in the morning which covered the basics of statistics, some details about the best sort of visualisations to use and some specifically corpus based statistics stuff such as collocations. It was really useful to learn about it. After lunch there was a two hour hands-on session that utilised some web-based tools that Lancaster had produced. These were web-based front ends to the ‘R’ statistics package and allowed us to feed data to the tool in order to produce graphs and other statistics. We also got to play with the BNC64 corpus as well. It was a really useful day, although I was suffering from a rather nasty cold which made it a bit of an effort to get through!

Week Beginning 9th February 2015

I continued with Mapping Metaphor duties this week, as well as starting with the Dictionary of the Scots Language post-launch tasks and also getting back into Scots Thesaurus work too. For Mapping Metaphor I updated the way in which sample lexemes work. Ellen had sent me some data for a couple of categories, featuring real sample lexemes and directionality (with start dates being inferred from the sample lexemes). I uploaded this data and updated the system to ensure that real start dates and sample lexemes are used if these have been specified, while the existing random ones will continue to be used where no real data exists. I also updated the way in which the sample lexemes link through to the Historical Thesaurus. Previously a link from a word simple performed a search for that word in the HT, which was not very accurate as a user may then have ended up looking at a category that was completely different to the metaphorical one they were looking at. Now the link takes the user to the specific usage of the word, loading the relevant HT category and highlighting the word in question. It links the two resources together very nicely.

I’ve also updated the top-level timeline slightly – it is now possible to view it without having a category selected (previously it was giving an error) and I’ve updated the counts in each column so they show the total number of L3 to L3 connections found rather than the number of squares in each column. I’ve updated the ‘drill down’ timeline so that if you’re looking at the connections to / from a specific L3 category then none of the dots are highlighted, unless you’ve selected one of the other categories linked to this category in the round visualisation, which results in only those dots involving this category being highlighted in the timeline. I also tweaked the top level timeline further so that highlighted squares are highlighted in yellow with a thinner line, which means it is easier to see the colour in the middle. The timeline now looks a bit like a series of skyscrapers with lights on. My final Metaphor task of the week was to create a list of known bugs that I aim to squash before the end of March. There are 25 items on it at the moment. Ellen was running two user testing sessions with some students this week so it may be that further bugs and issues are identified as a result of these.

For DSL I set about tackling issue 2 on our ‘to do’ list, which was to completely overhaul how the search results were displayed and how the entry page links back to the search results page. In the live site the search results are displayed as two columns – one for SND and the other for DOST. Users can scroll through these results lists independently, with 20 items being shown at any one time. When a user navigates to an entry the search results are displayed to the left of the entry, and the idea originally was that people would not need to return to the main search results page. Returning to the page places the user back at the start of the list of results.

It turns out that this arrangement isn’t really all that useful for people, who want an easier mechanism for jumping between sections of the results rather than scrolling through 20 at a time. People were also really wanting a way to jump back to the section of the results they were looking at when returning from an entry page to the results.

I’ve addressed all of these issues in the ‘development version’ of the site now, and once the changes have been approved I’ll add them to the main site. I updated the results page so that the SND and DOST sections are now separate tabs, with the inactive tab not visible and the active one taking up the full width of the page. This is much less cluttered than the previous layout. It also gives me more room to add in a new navigation bar. This appears both above and below the search results and presents users with ‘next’ and ‘previous’ buttons plus ‘jump to page’ buttons for pages in between. If there are more than 10 pages some are hidden and are replaced with ellipsis but clicking on these expands so show all pages. It’s now much easier for a user to jump to the last page of the results and work back (for example). I also updated the entry page so that returning to the search results from it automatically loads the page where the entry that was being viewed is found. I think it works a lot better.

My work on the Scots Thesaurus project this week let on from my meeting at SND last week. Previously for the Thesaurus project I had created a tool that allows a researcher to search the HTE for words and then search the XML of DSL for all occurrences of these words. I was working directly with the DSL XML files through a BaseX native XML database but the data I had didn’t have any sort of unique identifiers for entries, so it wasn’t possible for the system to log which entry the researcher thought was of interest. After speaking to Peter Bell last week I realised that it made a lot more sense to connect to the DSL API to query the data rather than using my own XML database. This means the Thesaurus project is using the most up to date dictionary data and more importantly the data actually has unique identifiers.

It took some time to get access to the API from my test scripts, but I managed to update the tool I had created to connect to the API in order to display DSL results. This allows the tool to ‘know’ which entry it is displaying (so I will be able to update it in future to automatically log the IDs in the Thesaurus database) and it also enables the researcher to follow links to the entries on the DSL website.

I also began to work on a mechanism to allow researchers to manage thesaurus categories, which is the first step in expanding the tool to enable words to automatically be associated with categories. I’ll continue with this next week.

Also this week I had a few meetings with people. I met with Mark Herraghty to discuss the Pennant project which he has recently started working on. It was great to catch up with this project again and it seems like the technical side of things is being well managed by Mark. I also met with Bryony Randall in English Literature to discuss a project she is putting together and I had a meeting with Ellen to discuss Old English data in Mapping Metaphor. We will need to create a dedicated Old English version of the site once the data is available and we agreed that I would get this up and running in June.


Week Beginning 2nd February 2015

This week was mostly split between three projects – Mapping Metaphor, SAMUELS / Historical Thesaurus and the Dictionary of the Scots Language. For Mapping Metaphor I replaced all of the existing metaphor data with the newly reworked data. This version of the data has duplicates stripped out and ‘relevant’ and ‘noise’ connections removed. It took a little bit of time to get the data uploaded due to the changes we’d made to category numbers – the new data still had the old ‘A01’ style numbers and these had to be mapped onto the new ‘1A01’ style numbers. This caused some issues with the H27 category, which had been split into 6 smaller categories (H27A-H27F) and then needed mapped onto differently numbered categories. With Ellen’s help I got there in the end and the data now comprises of 18,179 rows, 5,949 of which are classed as ‘Strong’ metaphors.

Once the new data was uploaded and tested I returned once again to the timeline visualisations. It had previously been agreed that we would not make a timeline available whenever L2 categories are not expanded. Previously these had appeared on the timeline as squares represented the aggregate L2 category but their appearance was rather confusing (as I discussed in last week’s post). I wanted to try making some sort of timeline available when non-expanded L2 categories are visible and decided to see if it would be feasible to just expand all of these categories and squeeze them into the timeline view. It turned out that the timeline view was big enough to accommodate this data. I’ve also set it up so that if a particular L3 category is selected then all of the connections across the time periods that involve this category are highlighted in pink. The screenshot below shows all of the connections to / from categories within the L2 category ‘1E Animals’, with the L3 category ‘Birds’ selected. You’ll also see that it’s possible to find information about the selected L3 category through the left-hand info panel.


I spent some time this week trying to find out how to over-ride the default width of a column on the x-axis of the timeline so as to make the Old English section wider than the others. I’m afraid that so far I have been unable to find out how to do this as each axis is generated and subdivided programmatically by D3. I’ll continue to investigate this issue, but it’s not looking too promising.

The rest of the Mapping Metaphor time this week was spent on the top-level timeline view. I had an idea as to how to display all of the connections throughout time at this level: Display the number of connections between each L2 category that have a start date within each time period – e.g. all of the connections between 1A and 1J that take place between 1150 and 1199 would be represented by one square in the timeline. Hovering over the square would bring up information about how many connections between the two categories began during this time while clicking on the square would bring up the top-level metaphor card view as with the regular visualisation. I realised that there would be an awful lot of squares in the timeline and so began to investigate how these could be arranged. I worked out a way to have multiple columns of squares within each time period and figured out how to squeeze five squares in a row. This resulted in the overall impression of a bar chart with it actually being made up of individual squares of different colours. I think it looks quite nice, but I have some reservations about how useful such a visualisation might actually be. An example of the top-level timeline can be viewed below:


For SAMUELS / the Historical Thesaurus this week I met with Fraser to discuss the possibility of fixing the version 1 to current number converter. We’ve altered the category numbers quite significantly from the printed version of the HT and the converter (see http://historicalthesaurus.arts.gla.ac.uk/versions-and-changes/#converter) allows you to convert numbers from one form to another. Unfortunately there were several thousand categories that didn’t have a version 1 number specified in the database, leading to ‘not found’ warnings. Fraser had found a way to fix the problem and generated a spreadsheet with the necessary mappings. A wrote a little script to firstly check the data and then to apply the changes and we have managed to plug pretty much all of the gaps.

For DSL I travelled through to Edinburgh on Thursday to meet with Ann and Peter to discuss the post-launch updates that Ann would like Peter and I to implement. It was a really useful meeting and very good to catch up with Peter and Ann again as it was the first time since the launch that I had seen them. We went through Ann’s list and discussed everything and in most cases formulated a plan of action for getting things sorted. Hopefully over the coming weeks I will be able to devote a bit of time to these tasks.