Week Beginning 31st August 2015

This week I returned to working a full five days, after the previous two part-time weeks. It was good to have a bit more time to work on the various projects I’m involved with, and to be able to actually get stuck into some development work again. On Monday and Tuesday and a bit of Thursday this week I focussed on the Scots Thesaurus project. The project is ending at the end of September so there’s going to be a bit of a final push over the coming weeks to get all of the outstanding tasks completed.

I spent quite a bit of time continuing to try to get an option to enable multiple parts of speech represented in the visualisations at the same time, but unfortunately I had to abandon this due to the limitations of my available time. It’s quite difficult to explain why allowing multiple parts of speech to appear in the same visualisation is tricky, but I’ll try. The difficulty is caused by the way parts of speech and categories are handled in the thesaurus database. A category for each part of speech is considered to be a completely separate entity, with a different unique identifier, different lexemes and subcategories. For example there isn’t just one category ‘01.01.11.02.08.02.02 Rain’, and then certain lexemes within it that are nouns and others that are verbs. Instead, ‘01.01.11.02.08.02.02n Rain’ is one category (ID 398) and ‘01.01.11.02.08.02.02v Rain’ is another, different category (ID 401). This is useful because categories of different parts of speech can then have different names (e.g. ‘Dew'(n) and ‘Cover with dew'(v)), but it also means building a multiple part of speech visualisation is tricky because the system is based around the IDs.

The tree based visualisations we’re using expect every element to have one parent category and if we try to include multiple parts of speech things get a bit confused as we no longer have a single top-level parent category as the noun categories have a different parent from the verbs etc. I thought of trying to get around this by just taking the category for one part of speech to be the top category but this is a little confusing if the multiple top categories have different names. It also makes it confusing to know where the ‘browse up’ link goes to if multiple parts of speech are displayed.

There is also the potential for confusion relating to the display of categories that are at the same level but with a different part of speech. It’s not currently possible to tell by looking at the visualisation which category ‘belongs’ to which part of speech when multiple parts of speech are selected, so for example if looking at both ‘n’ and ‘v’ we end up with two circles for ‘Rain’ but no way of telling which is ‘n’ and which is ‘v’. We could amalgamate these into one circle but that brings other problems if the categories have different names, like the ‘Dew’ example. Also, what then should happen with subcategories? If an ‘n’ category has 3 subcategories and a ‘v’ category has 2 subcategories and these are amalgamated it’s not possible to tell which main category the subcategories belong to. Also, subcategory numbers can be the same in different categories, so the ‘n’ category may have a subcategory ’01’ and a further one ‘01.01’ while the ‘v’ category may also have ones with the same numbers and it would be difficult to get these to display as separate subcategories.

There is also a further issue with us ending up with too much information in the right-hand column, where the lexemes in each category are displayed. If the user selects 2 or 3 parts of speech we then have to display the category headings and the words for each of these in the right-hand column, which can result in far too much data being displayed.

 

None of these issues are completely insurmountable, but I decided that given the limited amount of time I have left on the project it would be risky to continue to pursue this approach for the time being. Instead what I implemented is a feature that allows users to select a single part of speech to view from a list of available options. Users are able to, for example, switch from viewing ‘n’ to viewing ‘v’ and back again, but can’t to view both ‘n’ and ‘v’ at the same time. I think this facility works well enough and considerably cuts down on the potential for confusion.

After completing the part of speech facility I moved onto some of the other outstanding, ono-visualisation tasks I still have to tackle, namely a ‘browse’ facility and the search facilities. Using WordPress shortcodes I created an option that lists all of the top level main categories in the system – i.e. those categories that have no parent category. This option provides a pathway into the thesaurus data and is a handy reference showing which semantic areas the project has so far tackled. I also began work on the search facilities, which will work in a very similar manner to those offered by the Historical Thesaurus of English. So far I’ve managed to create the required search forms but not the search that this needs to connect to.

After making this progress with non-visualisation features I returned to the visualisations. The visualisation style we had adopted was a radial tree, based on this example: http://bl.ocks.org/mbostock/4063550. This approach worked well for representing the hierarchical nature of the thesaurus, but it was quite hard to read the labels. I decided instead to investigate a more traditional tree approach, initially hoping to get a workable vertical tree, with the parent node at the top and levels down the hierarchy from this expanding down the page. Unfortunately our labels are rather long and this approach meant that there were a lot of categories on the same horizontal line of the visualisation, leading to a massive amount of overlap of labels. So instead I went for a horizontal tree approach, and adapted a very nice collapsible tree style similar to the one found here: http://mbostock.github.io/d3/talk/20111018/tree.html. I continued to work on this on Thursday and I have managed to get a first version integrated with the WordPress plugin I’m developing.

Also on Thursday I met with Susan and Magda to discuss the project and the technical tasks that are still outstanding. We agreed on what I should focus in my remaining time and we also discussed the launch at the end of the month. We also had a further meeting with Wendy, as a representative of the steering group, and showed her what we’d been working on.

On Wednesday this week I focussed on Medical Humanities. I spent a few hours adding a new facility to the SciFiMedHums database and WordPress plugin to enable bibliographical items to cross reference any number of other items. This facility adds such a connection in both directions, allowing (for example) Blade Runner to have an ‘adapted from’ relationship with ‘Do androids dream of electric sheep’ and for the relationship in the other direction to then automatically be recorded with an ‘adapted into’ relationship.

I spent the remainder of Wednesday and some other bits of free time continuing to work on the Medical Humanities Network website and CMS. I have now completed the pages and the management scripts for managing people and projects and have begun work on Keywords. There should be enough in place now to enable the project staff to start uploading content and I will continue to add in the other features (e.g. collections, teaching materials) over the next few weeks.

On Friday I met with Stuart Gillespie to discuss some possibilities for developing an online resource out of a research project he is currently in the middle of. We had a useful discussion and hopefully this will develop into a great resource if funding can be secured. The rest of my available time this week was spent on the Hansard materials again. After discussions with Fraser I think I now have a firmer grasp on where the metadata that we require for search purposes is located. I managed to get access to information about speeches from one of the files supplied by Lancaster and also access to the metadata used in the Millbanksystems website relating to constituencies, offices and things like that. The only thing we don’t seem to have access to is which party a member belonged to, which is a shame is this would be hugely useful information. Fraser is going to chase this up, but in the meantime I have the bulk of the required information. On Friday I wrote a script to extract the information relating to speeches from the file sent by Lancaster. This will allow us to limit the visualisations by speaker, and also hopefully by constituency and office too. I also worked some more with the visualisation, writing a script that created output files for each thematic heading in the two-year sample data I’m using, to enable these to be plugged into the visualisation. I also started to work on facilities to allow a user to specify which thematic headings to search for, but I didn’t quite manage to get this working before the end of the day. I’ll continue with this next week.