
Category: Helsinki Corpus
Week Beginning 26th October 2015
I returned to a more normal working week this week, after having spent the previous one at a conference and the one before that on holiday. I probably spent about a day catching up with emails, submitting my expenses claim and writing last week’s rather extensive conference report / blog post. I also decided it was about time that I gathered all of my outstanding tasks together into one long ‘to do’ list as I seem to have a lot going on at the moment. The list currently has 47 items on it split across more than 12 different projects, not including other projects that will be starting up in the next month or two. There’s rather a lot going on at the moment and it is good to have everything written down in one place so I don’t forget anything. I also had some AHRC review duties to perform this week as well, which took up some further time.
With these tasks out of the way I could get stuck into working on some of my outstanding projects again. I met with Hannah Tweed on Tuesday to go through the Medical Humanities Network website with her. She had begun to populate the content management system with projects and people now and had encountered a few bugs and areas of confusion so we went through the system and I made a note of things that needed fixed. These were all thankfully small issues and all easily fixable, such as supressing the display of fields when the information isn’t available and it was good to get things working properly. I also returned to the SciFiMedHums bibliographical database. I updated the layout of the ‘associated information’ section of the ‘view item’ page to make it look nicer and I created the ‘advanced search’ form, that enables users to search for things like themes, mediums, dates, people and places. I also reworked the search results page to add in pagination, with results currently getting split over multiple pages when more than 10 items are returned. I’ve pretty much finished all I can do on this project now until I get some feedback from Gavin. I also helped Zanne to get some videos reformatted and uploaded to the Academic Publishing website, which will probably be my final task for this project.
Wendy contacted me this week to say that she’d spotted some slightly odd behaviour with the Scots Corpus website. The advanced search was saying that there were 1317 documents in the system but a search returning all of them was saying that it matched 99.92% of the corpus. The regular search stated that there were 1316 documents. We figured out that this was being caused by a request we had earlier this year to remove a document from the corpus. I had figured out a way to delete it but evidently there was some data somewhere that hadn’t been successfully updated. I managed to track this down: it turned out that the number of documents and the total number of words was being stored statically in a database table, and the advanced search was referencing this. Having discovered this I updated the static table and everything was sorted. Wendy also asked me about further updates to the Corpus that she would like to see in place before a new edition of a book goes to the printers in January. We agreed that it would be good to rework the advanced search criteria selection as the options are just too confusing as they stand. There is also a slight issue with the concordance ordering that I need to get sorted too.
At the conference last week Marc, Fraser and I met with Terttu Nevalainen and Matti Rissanen to discuss Glasgow hosting the Helsinki Corpus, which is currently only available on CD. This week I spent some time looking through the source code and getting a bit of server space set aside for hosting the resource. The scripts that power the corpus are Python based and I’ve not had a massive amount of experience with Python, but looking through the source code it all seemed fairly easy to understand. I managed to get the necessary scripts and the data (mostly XML and some plain text) uploaded to the server and the scripts executing. The only change I have so far made to the code is to remove the ‘Exit’ tab as this is no longer applicable. We will need to update some of the site text and also add in a ‘hosted by Glasgow’ link somewhere. The corpus all seems to work online in the same way as it does on the CD now, which is great. The only problem is the speed of the search facilities. The search is very slow, and can take up to 30 seconds to run. Without delving into the code I can’t say why this is the case, but I would suspect it is because the script has to run through every XML file in the system each time the search runs. There doesn’t appear to be any caching or indexing of the data (e.g. using an XML database) and I would imagine that without using such facilities we won’t be able to do much to improve the speed. The test site isn’t publicly accessible yet as I need to speak to Marc about it before we take things further.