I decided this week to devote some time to redevelop the Thesaurus of Old English, to bring it into line with the work I’ve been doing to redevelop the main Historical Thesaurus website. I had thought I wouldn’t have time to do this before next week’s ‘Kay Day’ event but I decided that it would be better to tackle the redevelopment whilst the changes I’d made for the main site were still fresh in my mind, rather than coming back to it in possibly a few months’ time, having forgotten how I implemented the tree browse and things like that. It actually took me less time than I had anticipated to get the new version up and running, and by the end of Tuesday I had a new version in place that was structurally similar to the new HT site. We will hopefully be able to launch this alongside the new HT site towards the end of next week.
I sent the new URL to Carole Hough for feedback as I was aware that she had some issues with the existing TOE website. Carole sent me some useful feedback, which led to me making some additional changes to the site – mainly to the tree browse structure. The biggest issue is that the hierarchical structure of TOE doesn’t quite make sense. There are 18 top-level categories, but for some reason I am not at all clear about each top-level category isn’t a ‘parent’ category but is in fact a sibling category to the ones that are one level down. E.g, logically ’04 Consumption of food/drink’ would be the parent category of ’04.01’, ’04.02’ etc but in the TOE this isn’t the case, rather ’04.01’, ’04.02’ should sit alongside ‘04’. This really confuses both me and my tree browse code, which expects categories ‘xx.yy’ to be child categories of ‘xx’. This led to the tree browse putting categories where logically they belong, but within the confines of the TOE make no sense – e.g. we ended up with ’04.04 Weaving’ within ’04 Consumption of food/drink’!
To confuse matters further, there are some additional ‘super categories’ that I didn’t have in my TOE database but apparently should be used as the real 18 top-level categories. Rather confusingly these have the same numbers as the other top-level categories. So we now have ’04 Material Needs’ that has a child category ’04 Consumption of food/drink’ that then has ’04.04 Weaving’ as a sibling and not as a child as the number would suggest. This situation is a horrible mess that makes little sense to a user, but is even harder for a computer program to make sense of. Ideally we should renumber the categories in a more logical manner, but apparently this isn’t an option. Therefore I had to hack about with my code to try and allow it to cope with these weird anomalies. I just about managed to get it all working by the end of the week but there are a few issues that I still need to clear up next week. The biggest one is that all of the ‘xx.yy’ categories and their child categories are currently appearing in two places – within ‘xx’ where they logically belong and beside ‘xx’ where this crazy structure says they should be placed.
In addition to all this TOE madness I also spent some further time tweaking the new HT website, including updating the quick search box so the display doesn’t mess up on narrow screens, making some further tweaks to the photo gallery and making alterations to the interface. I also responded to a request from Fraser to update one of the scripts I’d written for the HT OED data migration that we’re still in the process of working through.
In terms of non-thesaurus related tasks this week, I was involved in a few other projects. I had to spend some time on some AHRC review duties. I also fixed an issue that had crept into the SCOTS and CMSW Corpus websites since their migration: the ‘download corpus as a zip’ issue was no longer working due to the PHP code using an old class to create the zip that was not compatible with the new server. I spent some time investigating this and finding a new way of using PHP to create zip files. I also locked down the SPADE website admin interface to IP address ranges of our partner institutions and fixed an issue with the SCOSYA questionnaire upload facility. I also responded to a request for information about TEI XML training from a PhD student and made a tweak to a page of the DSL website.
I spent the remainder of my week looking at some app issues. We are hopefully going to be releasing a new and completely overhauled version of the ARIES app by the end of the summer and I had been sent a document detailing the overall structure of the new site. I spent a bit of time creating a new version of the web-based ARIES app that reflected this structure, in preparation for receiving content. I also returned to the Metre app, that I’ve not done anything about since last year. I added in some explanatory text and I am hopefully going to be able to start wrapping this app up and deploying it to the App and Play stores soon. But possibly not until after my summer holiday, which starts the week after next.
I spent quite a bit of time this week on the Historical Thesaurus. A few tweaks ahead of Kay Day has now turned into a complete website redevelopment, so things are likely to get a little hectic over the next couple of weeks. Last week I implemented an initial version of a new HT tree-based browse mechanism but at the start of this week I still wasn’t sure how best to handle different parts of speech and subcategories. Originally I had thought we’d have a separate tree for each part of speech, but I came to realise that this was not going to work as the non-noun hierarchy has more gaps than actual content. There are also issues with subcategories as ones with the same number but different parts of speech have no direct connection. Main categories with the same number but different parts of speech always refer to the same thing – e.g. 01.02aj is the adjective version of 01.02.n. But subcategories just fill out the numbers, meaning 01.02|01.aj can be something entirely different to 01.02|01.n. This means providing an option to jump from a subcategory in one part of speech to another wouldn’t make sense.
Initially I went with the idea of having noun subcategories represented in the tree and the option to switch part of speech in the right-hand pane after the user selected a category in the tree (if a main category was selected). When a non-noun main category was selected then the subcategories for this part of speech would then be displayed under the main category words. This approach worked, but I felt that it was too inconsistent. I didn’t like that subcategories were handled differently depending on their part of speech. I therefore created two additional versions of the tree browser in addition to the one I created last week.
The second one has [+] and [-] instead of chevrons. It has the catnum in grey before the heading. The tree structure is the same as the first version (i.e. includes all noun categories and noun subcats). When you open a category the different parts of speech now appear as tabs, with ‘noun’ open by default. Hover over a tab to see the full part of speech and the heading for that part of speech. The good thing about the tabs is the currently active PoS doesn’t disappear from the list, as happens with the other view. When viewing a PoS that isn’t ‘noun’ and there are subcategories the full contents of these subcategories are visible underneath the maincat words. Subcats are indented and coloured to reflect their level, as with the ‘live’ site’s subcats, but here all lexemes are also displayed. As ‘noun’ subcats are handled differently and this could be confusing a line of text explains how to access these when viewing a non-noun category.
For the third version I removed all subcats from the tree and it only features noun maincats. It is therefore considerably less extensive, and no doubt less intimidating. In the category pane, the PoS selector is the same as the first version. The full subcat contents as in v2 are displayed for every PoS including nouns. This does make for some very long pages, but does at least mean all parts of speech are handled in the same way.
Marc, Fraser and I met to discuss the HT on Wednesday. It was a very productive meeting and we formed a plan about how to proceed with the revamp of the site. Marc showed us some new versions of the interface he has been working on too. There is going to be a new colour scheme and new fonts will be used too. Following on from the meeting I updated the navigation structure of the HT site, replaced all icons used in the site with Font Awesome icons, added in the facility to reload the ’random category’ that gets displayed on the homepage, moved the ‘quick search’ to the navigation bar of every page and made some other tweaks to the interface.
I spent more time towards the end of the week on the tree browser. I’ve updated the ‘parts of speech’ section so that the current PoS is also included. I’ve also updated the ordering to reflect the order in the printed HT and updated the abbreviations to match these too. Tooltips now give text as found in the HT PDF. The PoS beside the cat number is also now a tooltip. I’ve updated the ‘random category’ to display the correct PoS abbreviation too. I’ve also added in some default text that appears on the ‘browse’ page before you select a category.
- If it’s a subcat we don’t just want to display this, we need to grab its maincat, all of the maincat’s subcats but then ensure the passed subcat is displayed on screen.
- We need to build up the tree hierarchy, which is for nouns, so if the passed catid is not a noun category we need to also then find the appropriate noun category
I have sorted out point 1 now. If you pass a subcat ID to the page the maincat record is loaded and the page scrolls until the subcat is in view. I will also highlight the subcat as well, but haven’t done this yet. I’m still in the middle of addressing the second point. I know where and how to add in the grabbing of the noun category, I just haven’t had the time to do it yet. I also need to properly build up the tree structure and have the relevant parts open. This is still to do as currently only the tree from the maincat downwards is loaded in. It’s potentially going to be rather tricky to get the full tree represented and opened properly so I’ll be focussing on this next week. Also, T7 categories are currently giving an error in the tree. They all appear to have children and when you click on the [+] then an error occurs. I’ll get this fixed next week too. After that I’ll focus on integrating the search facilities with the tree view. Here’s a screenshot of how the tree currently looks:
I was pretty busy with other projects this week as well. I met with Thomas Clancy and Simon Taylor on Tuesday to discuss a new place-names project they are putting together. I will hopefully be able to be involved in this in some capacity, despite it not being based in the School of Critical Studies. I also helped Chris to migrate the SCOTS Corpus websites to a new server. This caused some issues with the PostGreSQL database that took use several hours to get to the bottom of. These were causing the search facilities to be completely broken, but thankfully I figured out what was causing this and by the end of the week the site was sitting on a new server. I also had an AHRC review to undertake this week.
On Friday I met with Marc and the group of people who are working on a new version of the ARIES app. I will be implementing their changes so it was good to speak to them and learn what they intend to do. The timing of this is going to be pretty tight as they want to release a new version by the end of August, so we’ll just need to see how this goes. I also made some updates to the ‘Burns and the Fiddle’ section of the Burns website. It’s looking like this new section will now launch in July.
Finally, I spent several hours on The People’s Voice project, implementing the ‘browse’ functionality for the database of poems. This includes a series of tabs for different ways of browsing the data. E.g. you can browse the titles of poems by initial letter, you can browse a list of authors, years of publication etc. Each list includes the items plus the number of poems that are associated with the item – so for example in the list of archives and libraries you can see that Aberdeen Central Library has 70 associated poems. You can then click on an item and view a list of all of the matching poems. I still need to create the page for actually viewing the poem record. This is pretty much the last thing I need to implement for the public database and all being well I’ll get this finished next Friday.
I wasn’t feeling very well at the start of the week, but instead of going home sick I managed to struggle through by focussing on some fairly unchallenging tasks, namely continuing to migrate the STARN materials to the University’s T4 system. I’m still ploughing through the Walter Scott novels, but I made a bit of progress. I also spent a little more time this week on AHRC duties.
I had a few meetings this week. I met with the Heads of School Administration, Wendy Burt and Nikki Axford on Tuesday to discuss some potential changes to my job, and then had a meeting with Marc on Wednesday to discuss this further. The outcome of these meetings is that actually there won’t be any changes after all, which is disappointing but at least after several months of the possibility hanging there it’s all decided now.
On Tuesday I also had a meeting with Bryony Randall to discuss her current AHRC project about editing modernist texts. I have a few days of effort assigned to this project, to help create a digital edition of a short story by Virginia Woolf and to lead a session on transcribing texts at a workshop in April, so we met to discuss how all this will proceed. We’ve agreed that I will create the digital edition, comprising facsimile images and multiple transcriptions with various features visible or hidden. Users will be able to create their own edition by deciding which features to include or hide, thus making users the editors of their own edition. Bryony is going to make various transcriptions in Word and I am then going to convert this into TEI text. The short story is only 6 pages long so it’s not going to be too onerous a task and it will be good experience to use TEI and Oxygen for a real project. I’ll get started on this next week.
I met with Fraser on Wednesday to discuss the OED updates for the Historical Thesaurus and also to talk about the Hansard texts again. We returned to the visualisations I’d made for the frequency of Thematic headings in the two-year sample of Hansard that I was working with. I should really try to find the time to return to this again as I had made some really good progress with the interface previously. Also this week I arranged to meet with Catriona Macdonald about The People’s Voice project and published this week’s new song of the week on the Burns website (http://burnsc21.glasgow.ac.uk/robert-bruces-address-to-his-army-at-bannockburn/).
Gary had also stated that some ‘or’ searches were not showing multiple icons when different attributes were selected. However, after some investigation I think this may just be because without supplying limits an ‘or’ search for two attributes will often result in the attributes both being present at every location, therefore all markers will be the same. E.g. a search for ‘D3 or A9’. There are definitely some combinations of attributes that do give multiple markers, e.g. ‘Q6 or D32’. And if you supply limits you generally get lots of different icons, e.g. ‘D3, young, 4-5 or A9, old, 4-5’. Gary is going to check this again for any specific examples that don’t seem right.
After that I began to think about the new Atlas search options that Gary would like me to implement, such as being able to search for entire groups of attributes (e.g. an entire parent category) rather than individual ones. At the moment I’m not entirely sure how this should work, specifically how the selected attributes should be joined. For example, if I select the parent ‘AFTER’ with limits ‘old’ and ‘rating 4-5’ would the atlas only then show me those locations where all ‘AFTER’ attributes (D3 and D4) are present with these limits? This would basically be the same as an individual attribute search for D3 and D4 joined by ‘and’. Or would it be an ‘or’ search? I’ve asked Gary for clarification but I haven’t heard back from him yet.
I also made a couple of minor cosmetic changes to the atlas. Attributes within parent categories are now listed alphabetically by their code rather than their name, and selected buttons are now yellow to make it clearer which are selected and to differentiate from the ‘hover over’ purple colour. I then further reworked the ‘Atlas display options’ so that the different search options are now housed in an ‘accordion’. This hopefully helps to declutter the section a little. As well as accordion sections for ‘Questionnaire Locations’ and ‘Attribute Search’ I have added in new sections for ‘Advanced Attribute Search’ and ‘My Map Data’. These don’t have anything useful in them yet but eventually ‘Advanced Attribute Search’ will feature the more expanded options that are available via the ‘consistency data’ view – i.e. options to select groups of attributes and alternative ways to select ratings. ‘My Map Data’ will be where users can upload their own CSV files and possibly access previously uploaded datasets. See the following screenshot for an idea of how the new accordion works.
I also started to think about how to implement the upload and display of a user’s CSV files and realised that a lot of the information about how points are displayed on the map is not included in the CSV file. For example, there’s no indication of the joins between attributes or the limiting factors that were used to generate the data contained in the file. This would mean when uploading the data the system wouldn’t be able to tell whether the points should be displayed as an ‘and’ map or an ‘or’ map. I have therefore updated the ‘download map data’ facility to add in the URL used to generate the file in the first row. This actually serves two useful purposes. Firstly it means on re-uploading the file the system can tell which limits and Boolean joins were used and display an appropriate map and secondly it means there is a record in the CSV file of where the data came from and what it contains. A user would be able to copy the URL into their browser to re-download the same dataset if (for example) they messed up their file. I’ll continue to think about the implementation of the CSV upload facility next week.
I probably spent the best part of a day on administrative tasks this week, including some relating to my role that I can’t really go into details about here. I also arranged to meet with Bryony Randall next week to discuss her text encoding project, and arranged a time for the Arts Developers to meet, which will be the week after next. I spoke with Gerry McKeever about a proposal he’s putting the finishing touches to and I helped Luca with a WordPress issue he wanted my advice with. I also had a chat with Graeme about a new proposal he’s helping to put together. I read through the materials he sent me and gave him some advice about technical approaches and costings. I also arranged a meeting with Catriona Macdonald next month regarding the People’s Voice project and helped Carole with a spam issue with one of her websites. I also spent about a further half a day on AHRC review duties, and will have to continue with this into next week too.
Other than the above I also spent a little bit of time on the Historical Thesaurus OED data import, ticking off another bunch of rows that Fraser had checked and speaking with Fraser about the next steps. I also launched this week’s Burns ‘song of the week’ (see http://burnsc21.glasgow.ac.uk/o-logan-sweetly-didst-thou-glide/). I had a hospital appointment on Friday morning so unfortunately lost a bit of time because of this. The remainder of my week was spent continuing with the migration of the ‘STARN’ resource to T4. As previously mentioned, it’s a pretty tedious task but it will be great to get it all done. I’m now about half-way through the final section: prose. This is a particularly long section, however, containing as it does a bunch of novels by Sir Walter Scott. It’s still going to be quite a while before I can get all of this finished as I am only doing a few pages here and there between other commitments.
My time this week was mostly spent on the SCOSYA Atlas again, continuing to work on the atlas search facilities, and specifically the Boolean ‘or’ search that has been proving rather tricky to get working properly. Last week I reworked my initial version of the ‘or’ search so that it would properly function when the same attribute with different limit options was selected, but during testing I realised that the search was behaving in unexpected ways when more than two attributes (or the same attribute with different limit options) were joined by an ‘or’: Rather than having the expected range of icons representing the different combinations of attributes at each location on the atlas many of the combinations were being given the same icon.
After quite a lot of head scratching I figured out what the problem was. The logic in the section of my code that loops through the locations and works out which attributes are present (or not present) at each had some flaws in it which was causing combinations that should have had two ‘yeses’ in them instead being incorrectly assigned as all ‘nos’. E.g. when three attributes are searched for and a location has the first two but not the third the combination should be ‘YYN’ but instead the code was quitting out with a ‘NNN’. Once identified a bit of tweaking and further testing corrected this issue and now a much broader selection of icons gets displayed when three or more attributes are joined by an ‘or’, as the screenshot below demonstrates.
The next thing I tackled was the map legend. As you can see in the above screenshot, the legend that’s displayed on an ‘or’ search map bears no relation to the icons found on the map. The legend instead displays the average ratings (between 1 and 5) found at each location, which is appropriate for the ‘and’ search but not an ‘or’ search with all its different icons. Adding the various ‘or’ icons to the legend actually required rather a lot of reworking of the atlas code. The icons on the map are all grouped into layers and then each layer is added to the map. These layers correspond to an item in the legend. So the ‘or’ map was still adding locations to a layer depending on its average rating rather than grouping locations based on which attribute combinations were present or absent. To have a legend that listed all the different icons and allowed the user to switch an icon on or off I needed to ensure that layers were set up for each of these attribute combinations.
It took quite a bit of reworking, but I managed to get the layer code updated so that a new layer was dynamically added for each present combination of attributes. So for example, with two attributes joined with an ‘or’ there could potentially be 4 layers (YN, YY, NY and NN), although in reality there may be less than this depending on the data. With this code in place I had a legend that listed the layers with their code combinations and the handy Leaflet checkboxes that allow you to show / hide each layer. This was all working great, but I then had to tackle my next problem: How to get the various shapes and colours of the icons represented in the legend.
The map icons I use are not actually ‘image’ files like PNGs or anything like that. They are actually markers that are dynamically generated as SVG images using the Leaflet DVF plugin. This works great as it allows me to create polygons or stars that have a random number of lines or points and can be assigned a random colour. However, replicating these dynamic shapes in the legend was not so easy. The Leaflet legend can only work with HTML by default so adding in SVG XML to make the shapes appear was not possible (or at least not possible without a massive amount of work). I spent some time trying to figure out how to get SVG shapes to appear in the legend but didn’t make much progress. I went to the gym at lunchtime and whilst jogging on the treadmill I had a brainwave: Could I not just make PNG versions of each possible shape, leaving the actual shape area transparent and then give the HTML image tag a background colour to match the required random colour?
My ‘random shape and colour’ generator only had a few possible shape options: Polygons with between 3 and 8 sides and stars with between 5 and 10 points. Any possible colour could then be applied to these. I created PNG files for each shape with the shape part transparent and the surrounding square white. I then updated the part of my code that generated the legend content to pull in the correct image (e.g. if the particular randomly generated marker was a polygon shape with three sides then it would reference the image file ‘poly-3.png’) and the randomly generated colour for the marker was then added to the HTML ‘img’ tag via CSS as a background colour. When displayed this then gave the effect of the shape being the background colour, with the white part of the PNG looking just like the white page background. It all worked very well, as the following screenshot demonstrates:
There is still much work to do with the atlas, though. For a start there is still some weird behaviour when a search combines ‘and’ and ‘or’. I’ll tackle this another week, though.
Other than SCOSYA work I had a chat with Graeme about some Leaflet things he’s beginning to experiment with. I had a chat with Luca about his projects and I also emailed all the other developers in the College of Arts to see if they’d like to meet up some time. I also did some administrative work I can’t really divulge here and replied to a query from Marc about the Hansard data. I made this week’s Burns ‘Song of the Week’ live (http://burnsc21.glasgow.ac.uk/i-love-my-jean/) and helped Carole out with a couple of issues. Other than that I continued to migrate the old STARN resource to the University’s T4 system. It’s pretty tedious work but I’m making some good progress with it. I’ve got through the bulk of the ‘poetry’ section now, at least.
I had a fairly easy first week back after the Christmas holidays as I was only working on the Thursday and Friday. On Thursday I spent some time catching up with emails and other such administrative tasks. I also spent some time preparing for a meeting I had on Friday with Alice Jenkins. She is putting together a proposal for a project that has a rather large and complicated digital component and before the meeting I read through the materials she had sent me and wrote a few pages of notes about how the technical aspects might be tackled. We then had a good meeting on Friday and we will be taking the proposal forward during the New Year, all being well. I can’t say much more about it here at this stage, though.
I spent some further time on Thursday and on Friday updating the content of the rather ancient ‘Learning with the Thesaurus of Old English’ website for Carole Hough. The whole website needs a complete overhaul but its exercises are built around an old version of the thesaurus that forms part of the resource and is quite different in its functionality from the new TOE online resource. So for now Carole just wanted some of the content of the existing website updated and we’ll leave the full redesign for later. This meant going through a list of changes Carole had compiled and making the necessary updates, which took a bit of time but wasn’t particularly challenging to do – so a good way to start back after the hols.
Other than these tasks I spent the remainder of the week going through the old STELLA resource STARN and migrating it to T4. Before Christmas I had completed ‘Criticism and commentary’ and this week I completed ‘Journalism’ and made a start on ‘Language’. However, this latter section actually has a massive amount of content tucked away in subsections and it is going to take rather a long time to get this all moved over. Luckily there’s no rush to get this done and I’ll just keep pegging away at it whenever I have a free moment or two over the next few months.
This is the last working week of 2016 and I worked the full five days. The University was very quiet by Friday. I spent most of the week working on the SCOSYA project, working through my list of outstanding items and meeting a few times with Gary, which resulted in more items being added to the list. At the project meeting a couple of weeks ago Gary had pointed out that some places like Kirkcaldy were not appearing on the atlas and I managed to figure out why this was. There are 6 questionnaires for Kirkcaldy in the system and the ‘Questionnaire Locations’ map splits the locations into different layers based on the number of questionnaires completed. There are only four layers as each location was supposed to have 4 questionnaires. If a location has more than this then there was no layer to add the location to, so it was getting missed off. The same issue applied to ‘Fintry’, ‘Ayr’ and ‘Oxgangs’ too as they all have five questionnaires. Once I identified the problem I updated the atlas so that locations with more than 4 locations do now appear in the location view. These are marked with a black box so you can tell they might need fixing. Thankfully the data for these locations was already being used as normal in the ‘attribute’ atlas searches.
With that out of the way I tackled a bigger item on my list: Adding in facilities to allow staff to record information about the usage of codes in the questionnaire transcripts. I created a spreadsheet consisting of all of the codes through which Gary can note whether a code is expected to be identifiable in the transcripts or not and I updated the database and the CMS to add in fields for recording this. I then updated the ‘view questionnaire’ page in the CMS to add in facilities to add / view information about the use of the codes in the transcripts.
Codes that have ‘Y’ or ‘M’ for whether they appear in recordings are highlighted in the ‘view questionnaire’ page with a blue border and the ‘code ratings’ table now has four new columns for number of examples found in the transcript, Examples, whether this matches expectation and transcript notes (there is no data for these columns in the system yet, though). You can add data to these columns by pressing on the ‘edit’ button at the top of the ‘view questionnaire’ page and then finding the highlighted code rows, which will be the only ones that have text boxes and things in the four new columns. Add the required data and press the ‘update’ button and the information will be saved.
After that I started to work on a new ‘interviewed by’ limit for the Atlas that will allow a user to only show data where the interview was conducted by a fieldworker or not. I didn’t get very far with this, however, as Gary instead wanted me to create a new feature that will help him and Jennifer analyse the data. It’s a script that allows the interview data in the database to be exported in CSV format for further analysis in Excel.
It allows you to select an age group, select whether to include spurious data or not and limit the output to particular codes / code parents or view all. Note that ‘view all’ also includes codes that don’t have parents assigned.
The resulting CSV file lists one column per interviewee, arranged alphabetically by location. For each interviewee there are then rows for their age group and location. If you’ve included spurious data a further row gives you a count of the number of spurious ratings for the interviewee.
After these rows there are rows for each code that you’ve asked to include. Codes are listed with their parent and attributes to make it easier to tell what’s what. With ‘all codes’ selected there are a lot of empty rows at the top as codes with no parent are listed first. Note that if you want to exclude codes that don’t have parents in the code selection list simply deselect and reselect the checkbox for parent ‘AFTER’. This means all parents are selected but the ‘All’ box is unselected.
For each code for each interviewee the rating is entered if there was one. If you’ve selected to include spurious data these ratings are marked with an asterisk. Where a code wasn’t present in the interview the cell is left blank.
Other than SCOSYA duties I did a bit more Historical Thesaurus work this week, creating the ‘Levensthein’ script that Fraser wanted, as discussed last week. I started to implement a PHP version of the Levenshtein algorithm on that page I linked to in my previous post but thankfully my text editor highlighted the word ‘Levenshtein’, as it does with existing PHP functions it recognises. Thank goodness it did as it turns out PHP has its own ready-to-use Levenshtein function! See http://php.net/manual/en/function.levenshtein.php
All you have to do is pass it two strings and it spits out a number showing you how similar or different they are. I therefore updated my script to incorporate this as an option. You can specify a threshold level and also state whether you want to view those that are under and equal to the threshold or over the threshold. Add the threshold by adding ‘lev=n’ to the URL (where n is the threshold). By default it will display those categories that are over the threshold but to view those that are under or equal instead then add ‘under=y’ to the URL.
The test seems to work rather well when you set the threshold to 3 with punctuation removed and look for everything over that. That gives just 3838 categories that are considered different, compared with the 5770 without the Levenshtein test. Hopefully after Christmas Fraser will be able to put the script to good use.
I spent the remainder of the week continuing to migrate some of the old STELLA resources to the Univeristy’s T4 system. I completed the migration of the ‘Bibliography of Scottish Literature’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/bibliography-of-scottish-literature/. I then worked through the ‘Analytical index to the publications of the International Phonetic Association 1886-2006’, which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/ipa-index/. I then began working through the STARN resource (see http://www.arts.gla.ac.uk/stella/STARN/index.html) and managed to complete work on the first section (Criticism & Commentary). It’s going to take a long time to get the resource fully migrated over, though, as there’s a lot of content. The migrated site won’t ‘go live’ until all of the content has been moved.
And that’s pretty much it for this week and this year!
I spent a fair amount of this week overhauling some of the STELLA resources and migrating them to the University’s T4 system. This has been pretty tedious and time consuming, but it’s something that will only have to be done once and if I don’t do it no-one else will. I completed the migration of the pages about Jane Stuart-Smith’s ‘Accent Change in Glaswegian’ project (which can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/accent-change-in-glaswegian/). I ran into some issues with linking to images in the T4 media library and had to ask the web team to manually approve some of the images. It would appear that linking to images before they have been approved by the system by guessing what their filename will be somehow causes the system to block the approval of the images, so I’ll need to make sure I’m not being too clever in future. I also worked my way through the old STELLA resource ‘A Bibliography of Scottish Literature’ but I haven’t quite finished this yet. I have one section left to do, so hopefully I’ll be able to make this ‘live’ before Christmas.
Other than the legacy STELLA work I spent some time on another AHRC review that I’d been given, made another few tweaks to Carolyn Jess-Cooke’s project website and had an email conversation with Alice Jenkins about a project she is putting together. I’m going to meet with her in the first week of January to discuss this further. I also had some App management duties to attend to, namely giving some staff in MVLS access to app analytics.
Other than these tasks, I spent some time working on the Historical Thesaurus, as Fraser and I are still trying to figure out the best strategy for incorporating the new data from the OED. I created a new script that attempts to work out which categories in the two datasets match up based on their names. First of all it picks out all of the categories that are nouns that match between HT and OED. ‘Match’ means the our ‘oedmaincat’ field (combined with ‘subcat’ where appropriate) matches the OED’s ‘path’ field (combined with ‘sub’ where appropriate). Our ‘oedmaincat’ field is the ‘v1maincat’ field that has had some additional reworking done to it based on the document of changes Fraser had previously sent to me.
These categories can be split into three groups:
- 1. Ones where the HT and OED headings are identical (case insensitive)
- 2. Ones where the HT and OED headings are not identical (case insensitive)
- 3. Ones where there is no matching OED category for the HT category (likely due to our ‘empty categories’)
For our current purposes we’re most interested in number 2 in this list. I therefore created a version of the script that only displayed these categories, outputting a table containing the columns Fraser had requested. I also put the category heading string that was actually searched for in brackets after the heading as it appears in the database.
At the bottom of the script I also outputted some statistics: How many noun categories there are in total (124355), how many there are that don’t match (21109) and how many HT noun categories don’t have a corresponding OED category (6334). I also created a version of the script that outputs all categories rather than just number 2 in the list above. And made a further version that strips out punctuation when comparing headings too. This converts dashes to spaces, removes commas, full-stops and apostrophes and replaces a slash with ‘ or ‘. This has rather a good effect on the categories that don’t match, reducing this down to 5770. At least some of these can be ‘fixed’ by further rules – e.g. a bunch starting at ID 40807 that have the format ‘native/inhabitant’ can be matched by ensuring ‘of’ is added after ‘inhabitant’.
Fraser wanted to run some sort of Levenshtein test on the remaining categories to see which ones are closely matched and which ones are clearly very different. I was looking at this page about Levenshtein tests: http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/editdistance/Levenshtein%20Distance.htm which includes a handy algorithm for testing the similarity or different of two strings. The algorithm isn’t available in PHP, but the Java version looks fairly straightforward to migrate to PHP. The algorithm discussed on this page allows you to compare two strings and to be given a number reflecting how similar or different the strings are, based on how many changes would be required to convert one string into another. E.g. a score or zero means the strings are identical. A score of 2 means two changes would be required to turn the first string into the second one (either changing a character or adding / subtracting a character).
I could incorporate the algorithm on this page into my script, running the 5770 heading pairings through it. We could then set a threshold where we consider the headings to be ‘the same’ or not. E.g. ID 224446 ‘score-book’ and ‘score book’ would give a score of 1 and could therefore be considered ‘the same’, while ID 145656 would give a very high score as the HT heading is ‘as a belief’ while the OED heading is ‘maintains what is disputed or denied’(!).
I met with Fraser on Wednesday and we agreed that I would update my script accordingly. I will allow the user (i.e. Fraser) to pass a threshold number to the script that will then display only those categories that are above or below this threshold (depending on what is selected). I’m going to try and complete this next week.
I spent the majority of this week working for the SCOSYA project, in advance of our all-day meeting on Friday. I met with Gary on Monday to discuss some additional changes he wanted made to the ‘consistency data’ view and other parts of the content management system. The biggest update was to add a new search facility to the ‘consistency data’ page that allows you to select whether data is ‘consistent’ or ‘mixed’ based on the distance between the ratings. Previously to work out ‘mixed’ scores you specified which scores were considered ‘low’ and which were considered ‘high’ and everything else was ‘mixed’, but this new way provides a more useful means of grouping the scores. E.g. you can specify that a ‘mixed’ score is anything where the ratings for a location are separated by 3 or more points. So ratings of 1 and 2 are consistent but ratings of 1 and 4 are mixed. In addition users can state whether a pairing of ‘2’ and ‘4’ is always considered ‘mixed’. This is because ‘2’ is generally always a ‘low’ score and ‘4’ is always a ‘high’ score, even though there are only two rating points between the scores.
I also updated the system to allow users to focus on locations and attributes where a specific rating has been given. Users can select a rating (e.g. 2) and the table of results only shows which attributes at each location have one or more rating of 2. The matching cells just say ‘present’ while other attributes at each location have blank cells in the table. Instead of %mixed, %high etc there is %present – the percentage of each location and attribute where this rating is found.
I also added in the option to view all of the ‘score groups’ for ratings – i.e. the percentage of each combination of scores for each attribute. E.g. 10% of the ratings for Attribute A are ‘1 and 2’, 50% are ‘4 and 5’.
With these changes in place I then updated the narrowing of a consistency data search to specific attributes. Previously the search facility allowed staff to select one or more ‘code parents’ to focus on rather than viewing the data for all attributes at once. I’ve now extended this so that users can open up each code parent and select / deselect the individual attributes contained within. This greatly extends the usefulness of the search tool. I also added in another limiting facility, this time allowing the user to select or deselect questionnaires. This can be used to focus on specific locations or to exclude certain questionnaires from a query if these are considered problematic questionnaires.
When I met with Gary on Monday he was keen to have access to the underlying SCOSYA database to maybe try running some queries directly on the SQL himself. We agreed that I would give him an SQL dump of the database and will help him get this set up on his laptop. I realised that we don’t have a document that describes the structure of the project database, which is not very good as without such a document it would be rather difficult for someone else to work with the system. I therefore spent a bit of time creating an entity-relationship diagram showing the structure of the database and writing a document that describes each table, the fields contained in them and the relationships between them. I feel much better knowing this document exists now.
On Friday was has a team meeting, involving the Co-Is for the project: David Adger and Caroline Heycock, in addition to Jennifer and Gary. I was a good meeting, and from a technical point of view it was particularly good to be able to demonstrate the atlas to David and Caroline and receive their feedback on it. For example, it wasn’t clear to either of them whether the ‘select rating’ buttons were selected or deselected, which led to confusing results (e.g. thinking 4-5 was selected but actually having 1-3 selected). This is something I will have to make a lot clearer. We also discussed alternative visualisation styles and the ‘pie chart’ map markers I mentioned in last week’s post. Jennifer thinks these will be just too cluttered on the map so we’re going to have to think of alternative ways of displaying the data – e.g. have a different icon for each combination of selected attribute, or have different layers that allow you to transition between different views of attributes so you can see what changes are introduced.
Other than SCOSYA related activities I completed a number of other tasks this week. I had an email chat with Carole about the Thesaurus of Old English teaching resource. I have now fixed the broken links in the existing version of the resource. However, it looks like there isn’t going to be an updated version any time soon as I pointed out that the resource would have to work with the new TOE website and not the old search options that appear in a frameset in the resource. As the new TOE functions quite differently from the old resource this would mean a complete rewrite of the exercises, which Carole understandably doesn’t want to do. Carole also mentioned that she and others find the new TOE website difficult to use, so we’ll have to see what we can do about that too.
I also spent a bit more time working through the STELLA resources. I spoke to Marc about the changes I’ve been making and we agreed that I should be added to the list of STELLA staff too. I’m going to be ‘STELLA Resources Director’ now, which sounds rather grand. I made a start on migrating the old ‘Bibliography of Scottish Literature’ website to T4 and also Jane’s ‘Accent change in Glaswegian’ resource too. I’ll try and get these completed next week.
I also completed work on the project website for Carolyn Jess-Cooke, and I’m very pleased with how this is looking now. It’s not live yet so I can’t link to it from here at the moment. I also spoke with Fraser about a further script he would like me to write to attempt to match up the historical thesaurus categories and the new data we received from the OED people. I’m going to try to create the script next week and we’re going to meet to discuss it.
I worked on rather a lot of different projects this week. I made some updates to the WordPress site I set up last week for Carolyn Jess-Cooke’s project, such as fixing an issue with the domain’s email forwarding. I replaced the website design of The People’s Voice project website with the new one I was working on last week, and this is now live: http://thepeoplesvoice.glasgow.ac.uk/. I think it looks a lot more visually appealing that the previous design, and I also added in a twitter feed to the right-hand column. I also had a phone conversation with Maria Dick about her research proposal and we have now agreed on the amount of technical effort she should budget for.
I received some further feedback about the Metre app this week from a colleague of Jean Anderson’s who very helpfully took the time to go through the resource. As a result of this feedback I made the following changes to the app:
- I’ve made the ‘Home’ button bigger
- I’ve fixed the erroneous syllable boundary in ‘ivy’
- When you’re viewing the first or last page in a section a ‘home’ button now appears where otherwise there would be a ‘next’ or ‘previous’ button
- I’ve removed the ‘info’ icon from the start of the text.
Jean also tried to find some introductory text about the app but was unable to do so. She’s asked if Marc can supply some, but I wold imagine he’s probably too busy to do so. I’ll have to chase this up or maybe write some text myself as it would be good to be able to get the app completed and published soon.
Also this week I had a phone conversation with Sarah Jones of the Digital Curation Centre about some help and documentation I’ve given her about AHRC Technical Plans for a workshop she’s running. I also helped out with two other non-SCS issues that cropped up. Firstly, the domain for TheGlasgowStory (http://theglasgowstory.com/), which was one of the first websites I worked on had expired and the website had therefore disappeared. As it’s been more than 10 years since the project ended no-one was keeping track of the domain subscription, but thankfully after some chasing about we’ve managed to get the domain ownership managed by IT Services and the renewal fee has now been paid. Secondly, IT Services were wanting to delete a database that belonged to Archive Services (who I used to work for) and I had to check on the status of this.
I also spent a little bit of time this week creating a few mock-up logos / banners for the Survey of Scottish Place-Names, which I’m probably going to be involved with in some capacity and I spoke to Carole about the redevelopment of the Thesaurus of Old English Teaching Package.
Also this week I finally got round to completing training in the University Web Site’s content management system, T4. After completing training I was given access to the STELLA pages within T4 and I’ve started to rework these. I went through the outdated list of links on the old STELLA site and have checked each one, updating URLS or removing links entirely where necessary. I’ve added in a few new ones too (e.g. to the BYU Corpus in the ‘Corpora’ section). This updated content now appears on the ‘English & Scots Links’ page in the University website (http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/englishscotslinks/) .
I also moved ‘Staff’ from its own page into a section on the STELLA T4 page to reduce left-hand navigation menu clutter. For the same reason I’ve removed the left-hand links to SCOTS, STARN, The Glasgow Review and the Bibliography of Scottish Literature, as these are all linked to elsewhere. I then renamed the ‘Teaching Packages’ page ‘Projects’ and have updated the content to provide direct links to the redeveloped resources first of all, and then links to all other ‘legacy’ resources, thus removing the need for the separate ‘STELLA digital resources’ page. See here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/. I updated the links to STELLA from my redeveloped resources to they go to this page now too. With all of this done I decided to migrate the old ‘The Glasgow Review’ collection of papers to T4. This was a long and tedious process, but it was satisfying to get it done. The resource can now be found here: http://www.gla.ac.uk/schools/critical/aboutus/resources/stella/projects/glasgowreview/
In addition to the above I also worked on the SCOSYA project, looking into alternative map marker possibilities, specifically how we can show information about multiple attributes through markers on the map. At our team meeting last week I mentioned the possibility of colour coding the selected attributes and representing them on the map using pie charts rather than circles for each map point and this week I found a library that will allow us to do such a thing, and also another form of marker called a ‘coxcomb chart’. I created a test version with both forms, that you can see below:
Note that the map is dark because that’s how the library’s default base map looks. Our map wouldn’t look like this. The library is pretty extensive and has other marker types available too, as you can see from this example page: http://humangeo.github.io/leaflet-dvf/examples/html/markers.html.
So in the above example, there are four attributes selected, and these are displayed for four locations. The coxcomb chart splits the circle into the number of attributes and then the depth of each segment reflects the average score for each attribute. E.g. looking at ‘Arbroath’ you can see at a glance that the ‘red’ attribute has a much higher average score than the ‘green’ attribute, while the marker for ‘Airdrie’ (to the east of Glasgow) has an empty segment where ‘pink’ should be, indicating that this attribute is not present at this location.
The two pie chart examples (Barrhead and Dumbarton) are each handled differently. For Barrhead the average score of each attribute is not taken into consideration at all. The ‘pie’ simply shows which attributes are present. All four are present so the ‘pie’ is split into quarters. If one attribute wasn’t found at this location then it would be omitted and the ‘pie’ would be split into thirds. For Dumbarton average scores are taken into consideration, which changes the size of each segment. You can see that the ‘pink’ attribute has a higher average than the ‘red’ one. However, I think this layout is rather confusing as at a glance it seems to suggest that there are more of the ‘pink’ attribute, rather than the average being higher. It’s probably best not to go with this one.
Both the pies and coxcombs are a fixed size no matter what the zoom level, so when you zoom far out they stay big rather than getting too small to make out. On one hand this is good as it addresses a concern Gary raised about not being able to make out the circles when zoomed out. However, when zoomed out the map is potentially going to get very cluttered, which will introduce new problems. Towards the end of the week I heard back from Gary and Jennifer, and they wanted to meet with me to discuss the possibilities before I proceeded any further with this. We have an all-day team meeting planned for next Friday, which seems like a good opportunity to discuss this.