Week Beginning 16th February 2015

I spent roughly half this week working on the Scots Thesaurus project. I’d previously developed a version of a tool that queried the DSL XML files stored within a BaseX database and Susan wanted the facility to be able to select DSL entries and automatically associate them with thesaurus categories. This wasn’t really possible because the DSL XML data I was working with didn’t include unique identifiers for each dictionary entry. At my meeting with the SND people a couple of weeks ago Peter Bell suggested that I use the API that powers the new DSL website in order to get the data, as the data stored within this is properly separated out with unique identifiers. The API allows the same search that the tool was previously carrying out on the XML files (searching the full text of each entry without the citations) so I decided to rebuild the tool to use the API data rather than the older XML data. This has really helped to get a much more useful and usable tool for managing categories and words. The new tool I created this week allows staff to do the following:

  1. Browse the categories that have already been created

I’ve taken a lot of the code that powers the HTE for this. Users can browse up and down the hierarchy, view subcategories, view words associated with categories and follow links to the DSL entry for each word. Users can also jump to a specific category by entering its number and selecting the part of speech. Users can also select a category in order to add words to it (see below).

  1. Add a new category

Using this facility, staff users can add new main categories and subcategories to the database.

  1. Add words to a category

I’ve updated the ‘search for words’ facility that I’d previously created. If a staff user has selected a category to add words to from the ‘browse category’ page and then visits the ‘add words’ page s/he can search the HTE for words and categories, view and edit the search terms that were created and then search for these terms in the DSL. The search queries the DSL API and returns results split into separate lists for DOST and SND. The user can follow links from each search result to the entries on the DSL website. The user can also tick a checkbox beside one or more results and then press the ‘add selected words to category’ button at the bottom of the page to automatically associate selected words with the category that was chosen. These words are then stored in the database. With a category selected if the user performs a word search that returns any words that are already associated with the category these are listed with their checkboxes already ticked. If these are ‘unticked’ and the user presses the ‘add selected words…’ button the association between the word and the category will be removed.

I think these facilities should allow words and categories for the project to be managed pretty effectively. I’ll need to hear back from Susan before I do anything further with the tool.

Other than Friday, the rest of my week was spent on Mapping Metaphor duties. I spent a bit of time going through the feedback from the two testing sessions that Ellen had arranged last week and noting any issues that had been raised that were not already in my ‘bug list’ document. There were only a few of these issues, which is relatively encouraging. The rest of my time was spent developing a content management system for the project. I created a table to hold staff user accounts, created the logging in and out logic (taken from previous projects so not particularly tricky to implement) and created the facilities that Ellen and Wendy had requested. The staff interface presents the staff user with a view of the metaphor categories pretty much identical to the general ‘browse’ interface. From this the user can select a category and this brings up a page listing the category name, its descriptors and a table containing all of the connections this category has to other categories. From this page the user can access a form where the category name and descriptors can be updated.

There is also another form where a new metaphorical connection involving the category can be created. This presents the selected category as ‘Category 1’ and has a drop-down list containing all of the categories, enabling the connection to be established. The user can also specify direction and strength and also supply a ‘first lexeme’ and other lexemes associated with the connection. These lexemes have to be supplied as their unique identifiers from the historical thesaurus database. Upon creation the system creates the new connection and generates a ‘first date’ for the connection based on the ‘first lexeme’ ID that was supplied.

Using the table of metaphor connections a staff user may also choose to edit a connection (which presents them with a form similar to the ‘add new’ form but with details already filled in) and a ‘delete connection’ function. This option doesn’t actually completely remove the connection from the database but copies the connection and its associated information (e.g. sample lexemes) to a newly created ‘deleted’ table, thus enabling accidentally deleted connections to be retrieved easily. I’ve sent Ellen the details and she’s going to test it all out so I’ll see if any further refinements are required after that.

On Friday this week I attended a workshop on Statistics in Corpus Linguistics at Stirling University. I was led by Lancaster University and it was a very interesting and useful day. Statistics is not an area I know much about and as I’m involved with the Scots Corpus and other projects that have a statistical element to them I figured it would be useful to learn a bit more about statistical methods. There was a two hour lecture in the morning which covered the basics of statistics, some details about the best sort of visualisations to use and some specifically corpus based statistics stuff such as collocations. It was really useful to learn about it. After lunch there was a two hour hands-on session that utilised some web-based tools that Lancaster had produced. These were web-based front ends to the ‘R’ statistics package and allowed us to feed data to the tool in order to produce graphs and other statistics. We also got to play with the BNC64 corpus as well. It was a really useful day, although I was suffering from a rather nasty cold which made it a bit of an effort to get through!