I got stuck into the re-importing of the Historical Thesaurus data this week, a massive undertaking with lots of different steps involved. Thankfully the last time I performed this task I documented the steps that were involved, which made things easier, although a number of these steps needed to be updated. We’d previously noticed a strange situation whereby Old English words that had initial ashes and thorns were losing these characters somewhere between the export from Access and the import into MySQL. I managed to track down what was causing this, which turned out to be a bug in PHP itself (see https://bugs.php.net/bug.php?id=55507). The function used to process CSV files doesn’t like unusual characters at the beginning of fields and these characters vanish. This was a bit of a problem as I was relying on this function to get access to the data. I fixed things by a simple hack – I added some extra characters to the beginning of the necessary fields, then stripped these out again after PHP had done its processing.
I also tackled and completed most of the other data related tasks this week as well, such as stripping out the initial full stops from the subcategory names, ensuring search term variants were generated with and without apostrophes and also that search term variants were generated for words with hyphens (additional forms were created with a space instead of the hyphen and also with no character – e.g. fire-place, fire place, fireplace).
I also managed to get a bit of XSLT working with the XML representation of the HT data in order to extract an up to date list of categories that have no words. These are not present in the Access database but are needed for the website in order to allow users to browse up and down the hierarchy. The XSLT worked well and I managed to get my hands on the more than 10,000 empty categories that exist. Running these through a little PHP script resulted in the categories getting added to the database.
I also did quite a bit of further work with search term variants. I ensured the full word appears in the search term table, in addition to any forms split off from this. I also completely reworked the scripts that deal with brackets so that every permutation of bracketed letters gets saved as a search term – not just ‘all bracketed letters’ and ‘no bracketed letters’. It was pretty tricky to develop this script but it appears to work well. I also reworked the script that deals with hyphens in words and managed to automatically build up full versions of words where only partial sections were located – e.g. wood-sear/-seer/-sere is split up to give three variants: wood-sear, wood-seer and wood-sere. There are still just over 300 hyphen words that will need to be manually fixed, but that’s not too bad considering there are almost 100,000 hyphenated words in the system.
I also began looking into the stop words and I created a script that can process these – removing words such as ‘the’ and storing a new variant without it. However, I need feedback from Marc and Christian as to how this should work before I execute it. I also looked into updating the date fields, but again I need to speak to Marc about these before I proceed. I did make a start updating the layout of the date search boxes, but I haven’t finished this yet.
Some non-HT things I did this week included attending a Burns seminar, which was hugely interesting, very enjoyable and a good opportunity to catch up with the Burns people. I also had a meeting with Charlotte Methuen about a possible project she is putting together. I had an email conversation with Nigel Leask about a project he is developing and I make a couple of minor tweaks to the ICOS 2014 website for Daria too.
It was a busy old week this week, starting on Monday with the Digital Humanities Network launch event. The day-long event was a huge success and everything went very smoothly. There were two half-hour talks in the morning, given by Marilyn Deegan and Bill Kretzschmar, each of which was very interesting and very different; the first being a general talk about Digital Humanities while the second looked in more detail at a specific digital humanities project. The demo of the Digital Humanities Network website (http://www.digital-humanities.glasgow.ac.uk/) went well, and I think it’s looking pretty impressive now. In the afternoon there were a series of 5 minute talks about a wide range of Digital Humanities projects within the University and the format of these talks worked very well indeed – each speaker kept within their allotted time and it was a great way to learn more about the projects without getting too bogged down in detailed information about them. I did a five minute talk about the redevelopment of the STELLA learning and teaching applications, which went fine too.
On Tuesday there was a DROG meeting – the first in fact since February! As with previous meetings it was a good opportunity to discuss the work I have been undertaking, plus upcoming projects and priorities. It looks very much like it will be possible for me to redevelop the SCOTS corpus website at some point, which will be a nice big project to sink my teeth into. Dave Beavan, who was responsible for developing the SCOTS website, was at the DHN event on Monday and I had a brief chat about it and the possibility of it being redeveloped, which was useful.
On Thursday I had arranged to meet the Scottish Language Dictionaries people in Edinburgh to discuss the redevelopment of the Dictionary of the Scots Language website. Ann Ferguson had previously sent me through some requirements documents and Word-based mock-ups of the desired user interface elements, plus access to two test versions of the website, so I spent a good deal of the remainder of Tuesday and also the Wednesday going through all of these and preparing a document of discussion points for the meeting. The meeting itself went very well. All of my questions were answered and it was great to meet Peter Bell, the developer of the test versions and the guy who is going to develop the API for the new version of the website. The meeting lasted almost 3 hours but I think we all had a clearer idea of what would be happening with the redevelopment of the website afterwards. It was agreed that I would design the front end and Peter would develop the API. I will get some server space set up in Glasgow and will then develop some static HTML mock-ups of possible interfaces for discussion and further refinement.
On Friday I wrote up my notes from the meeting and did some investigation about server possibilities. I spent the rest of the day further tweaking the HT website, and next week I will settle down to the big task of re-importing all of the data from Access to MySQL. It’s going to take quite a while and possibly be quite tricky to get everything working as it should do.
This week I continued to work through my ‘to do’ list for the Historical Thesaurus website. The biggest thing I tackled was to make subcategories properly hierarchical within the category page. I had previously implemented a nice little indented list of subcategories, with a greater indentation representing a lower level subcategory, and this appears (when the user asks for it) on the category page when viewing a main category that has one or more subcategories. But if the user clicks to view a subcategory and it contains lower level subcategories or is a child of a higher level subcategory none of this information was being represented – instead all subcategories were flattened and being treated as if they were all one level down from the main category. After quite a bit of reworking of the category page I managed to sort this. Now when viewing a subcategory any parent subcategories appear in the ‘Up hierarchy to’ bar (with a darker background colour to differentiate these from main categories) and any child categories appear in the subcategory section, as happens with a main category. An example of this can be seen here:
I made some further updates to the user interface as well this week, including adding in an option to select all verb forms to the Part of Speech selection section of the Advanced Search page, and a facility to allow users to easily enter ashes, thorns and yoghs into the word search box by simply clicking on a button. I also look some more into optimising the advanced search and I have succeeded in speeding up the search algorithm significantly in most cases.
I also spent quite a bit of time this week finalising the Digital Humanities website before next Monday’s launch. The university web design was altered slightly this week so I had to update the DHN template to accurately reflect this. I also made some further tweaks to the layout and added some further content. I also had to prepare my 5 minute talk on the redevelopment of the STELLA teaching applications, which I will give at the event, and I met with Jeremy to discuss how we will jointly demo the content of the website. Hopefully all will go smoothly.
I met with Jean on Tuesday to discuss the Digital Humanities event and to finalise a few details. I also had a phone conversation with Ann Ferguson at Scottish Dictionaries about the redevelopment of the Dictionary of the Scots Language, which I will be involved with in the coming months. We will be having a meeting in Edinburgh to discuss things further next Thursday, and Ann sent me some mock-ups and requirements information about the proposed redesign, which was very useful.
I also had a chat with Pauline MacKay about the Burns timeline that we were hoping to publish this month. It would appear that there have been some delays with the Prose Volume and it now looks like the timeline won’t need to go live until January instead, but we had a useful chat about some of the other outstanding issues, including maps and the podcasts that need to go live next month.