As with last week, I spent most of this week working on the Historical Thesaurus redevelopment. The focus this week was on the search options, firstly generating scripts that would be able to extract all individual word variants and store these as separate entries in a database for search purposes, and secondly working on the search front end.
In addition to extracting forms separated by a slash the script also looks for brackets and generates versions of words with and without brackets – so for example hono(u)r results in two variants – honour and honor. This would then allow exact words to be matched as well as allow for wildcard searches. The script works well in most instances, but there are some situations where the way in which the information has been stored makes automated extraction difficult, for example ‘weather-gleam/glim’, ‘champian/-ion’, ‘(ge)hawian (on/to)’. In these cases the full version of the word / phrase is not repeated after the slash, and it would be very difficult to establish rules to determine what the script would do with the part after the slash. Christian, Marc and I met on Thursday to discuss what might be done about this, including using a list of ‘stop words’ that the search script would ignore (e.g. prepositions). I will also look into situations where hyphens appear after a slash to see if there is a way to automate what happens to these words. It is looking like at least some manual editing of words will be required at some point, however.
During the week I ran my script to generate search terms, resulting in 855,810 forms. The majority of these will have been extracted successfully, and I estimate that there are maybe 3-4000 words that might need to be manually fixed at some point. However, even with these words it is likely that a wildcard search would still successfully retrieve the word in question.
I spent most of my remaining time on HT matters working on the category selection page and the quick search. I have now managed to get a quick search up and running that searches words and category headings and uses asterisks for wildcards at the beginning and end of a search term. The quick search leads to the category selection page which pulls out all matching categories and lexemes. It creates a ‘recommended’ section which includes lexemes where the search term appears in both the lexeme and the category heading, and a big long list of all other returned hits underneath. I have also added in pagination for results too. Marc and Christian are wanting the results list to be split into sections where the search term appears in the lexeme and then where it appears only in the category, which I will do next week. The search is still a bit slow at the moment and I’ll need to look into optimising it soon, either by using more indexes or by generating cached versions of search results.
In addition to this I responded to a query about developing a project website that was sent to me by Charlotte Metheun in Theology and I provided some advice to someone in another part of the university who was wanting to develop a Google Maps style interface similar to the one I made for Archive Services. I also made some further updates to the ICOS 2014 website, adding in the banner and logo images and making a few other visual tweaks. My input into this website is now pretty much complete. I also arranged to meet Jean to discuss finalising the Digital Humanities Network website, and I signed up as a ‘five minute speaker’ for the Digital Humanities website launch. I’ll be talking about the redevelopment of the STELLA Teaching resources.