Monday was a holiday this week, so I plunged back into the redevelopment of the Historical Thesaurus website on Tuesday, after being ill and then working on other things last week. The big thing I tackled this week was the implementation of the Advanced Search. This is now fully operational, although it is a bit slow when category information is added to the search criteria – I’ll need to look into this further. But it does now work – you can search for words, parts of speech, labels, categories and dates (and any combination of these). I have updated the layout of the search page, adding Jquery UI tabs to split up quick, advanced and ‘jump to category’ searches and adding in help text as hover-overs. I may have to look at alternatives to this, though, as hover-overs don’t work on touchscreens. I also tweaked the ‘jump to category’ page to make the ‘t’ boxes automatically focus the next field when two characters are entered in a box, which vastly speeds up the entering of information. I’ve also made the search form ‘remember’ what a user has searched for, enabling search term refinement, and I made sure that the correct tab loads when following links from the ‘category selection’ page so for example if you’ve done an advanced search you don’t end up looking at the quick search.
I spent quite a bit of time this week working on improvements to the user interface of the website, which has been enjoyable. I’ve updated the homepage now so there are three blocks of content – introductory text, the quick search and a random category. The random category feature pulls back a category form the database that has at least one word in it each time the page loads, displaying up to 10 words from the category each time. It’s quite a nice feature, and a good way to jump straight into the category browse pages. Also this week I created a new script that generates the full HT category hierarchy from a given point in the system. Simply pass a category ID to the function and get an array of all parent categories. This is a very useful piece of code, and I’ve added it to both the random category feature and the category page, allowing users to jump straight to any point higher up in the HT hierarchy.
I also implemented some major updates to the category pages. I completely overhauled the way subcategories are displayed. Previously they were just displayed in a long line with no indication as to which subcategories were direct children of the main category and which were actually subcategories of other subcategories. This has now been rectified using indentation and changes in shade. Top level subcategories have no indent and a white background, level two subcategories are indented and have a slightly darker background, level three more so etc. I think it works pretty well. It can take up a lot of space for categories that have many subcategories though, and for this reason I’ve used a bit of jQuery to hide the list until a button is clicked on. I’ve also updated the links back to the main category from a subcategory so that the user is taken back to the open list of subcategories if following this navigation path, rather than being taken back to the top of the main category page and then having to scroll down to the list of subcategories and open it again.
I further updated the category page to improve the layout of the hierarchy traversal options and the options to view different parts of speech at the level currently being viewed. I think these navigation options work pretty well, but will await feedback from others.
I still need to do some further work with subcategories. Although the subcategory list is now hierarchical, subcategory pages are still actually all at one level. No matter how ‘deep’ the subcategory the only link back is to the main category, rather than to a parent subcategory. I will need to tackle this next week.
I met with Marc and Christian on Thursday and we spent a very useful couple of hours going through the site and tackling some of the questions that had accumulated since the last meeting. One outcome of the meeting is that I will need to update the way dates are searched for in the advanced search. Currently dates such as 1400/50 are recorded with 1400 in one column and 50 in another. I will need to update the database so that (for last cited dates) the later date is used. I will also need to update the search boxes to incorporate OE and Current options in both the first and the last cited date lines. There is also still a massive amount to do with the refresh of data from Access. That is going to be a rather large and somewhat daunting task.
Other than HT stuff this week I met with Stevie Barratt for a catch-up regarding the Corpus server. He had posted a question about redeveloping the user interface on the cqpweb mailing list and Andrew Hardie replied stating that separating out the user interface from the rest of the code to allow different layouts to be plugged in is not something that they are planning to tackle. They are hoping to develop an API, which would be hugely useful, but there is no timetable for this at the moment. It looks like Stevie is going to have to try and delve into the code and make changes directly to it, and he’s going to keep me posted on his progress, as eventually SCS will be wanting to use the same infrastructure. I also attended the HATII developers meeting on Thursday, which has now grown to encompass developers across the College of Arts, which is great. It is really useful to keep up to speed with projects and technical staff and know what people are working on.
Not a massive amount to report this week due to illness. I was struck down with a nasty feverish throat infection on Sunday and only came back to work on Thursday. Even then I was still feeling a bit wobbly. I spent Thursday morning catching up with the backlog of emails from the week and dealing with issues relating to them. In the afternoon I met with Johanna Green, who will be updating the content of the Digital Humanities Network website in preparation for the official launch in June. I went through the system with her and showed her how everything worked and she should be able to use the system without any problems. I also had a chat with Wendy about possible Scottish Corpus redevelopment and helped get access to usage stats for the site too, all in preparation for the REF.
I spent the remainder of the week working through the outstanding redevelopment tasks for the Digital Humanities Network website, including adding more filter options to the project page, ensuring the site works sufficiently well in old versions of IE and implementing some other visual tweaks and improvements.
The new and hopefully completed Digital Humanities at Glasgow website can now be found here:
A very late post for this week as I was working 8-4 on the Friday and ran out of time, then I was ill the Monday to Wednesday of the following week. I continued with the redevelopment of the Historical Thesaurus website for the majority of this week. The main achievements this week were the creation of the ‘category’ page, reached via the search results page. This page displays the words found within a particular category, and also provides extensive browse options to related material, for example any categories that have the same number but a different part of speech, any subcategories of the category in question, plus also any parent and child categories in the overall HT hierarchy. It took quite a long time to implement all these browse options, but I think it’s working rather nicely. There are some issues related to traversing the hierarchy due to issues with the data, but hopefully the bulk of these will be resolved.
I also added search term highlighting to the category page – with the user’s original search term (minus wildcards) highlighted wherever the term appears (including within longer words). This works throughout the hierarchy – so if the user searches for ‘*sausage*’ and accesses the subcategory ‘types of sausage’ then the term is highlighted wherever it appears within the words, and if the user then browses up the hierarchy to the main category ‘Sausage’ any occurrences of the term will be highlighted here too.
Also this week I reworked the search results (category selection) page with the aim of speeding up the queries. Previously queries were taking far too long to run – sometimes as much as 30 seconds, which is completely unacceptable. Thankfully after reworking things the search is significantly faster, generally loading the search results in less than a second for non-wildcard searches and only taking a little longer than this for wildcard searches. I think the search is now as fast as it needs to be.
Also this week I began work on the ‘advanced search’ page. I now have all of the required search options within a form on the search page, although for now none of the options actually work – that is still to be tackled next week. I have also added in the option of jumping straight to a specific category of you know the category number, and this option is fully operational.
I had a further meeting with Marc and Christian on Thursday this week, which was another useful opportunity to go through some of the outstanding tasks and make some decisions. We are still dealing with a number of issues with the HT data, some having come from the Access database, some introduced during the migration process and others resulting from the original format of the data. For example, there are some problems with empty categories. Categories that have no words were not part of the Access database, but are needed to properly enable the traversal of the hierarchy. Previously Marc gave me a spreadsheet containing a lot of the empty categories, but it turns out this list wasn’t up to date and there are some problems with category numbers having been changed. Marc is going to get the up to date categories to me from the XML file that was submitted to the OED people, which will help greatly with this.
Other than HT work I met with Jean this week to discuss finalising the redevelopment of the Digital Humanities Network website. We had a very useful meeting and we came up with a list of outstanding tasks that needed tackled. I spend most of Friday this week working through the list and managed to get most items implemented. The website is looking much better now, and I also made it live, replacing the older version, this week too. you can access it here:
As with last week, I spent most of this week working on the Historical Thesaurus redevelopment. The focus this week was on the search options, firstly generating scripts that would be able to extract all individual word variants and store these as separate entries in a database for search purposes, and secondly working on the search front end.
In addition to extracting forms separated by a slash the script also looks for brackets and generates versions of words with and without brackets – so for example hono(u)r results in two variants – honour and honor. This would then allow exact words to be matched as well as allow for wildcard searches. The script works well in most instances, but there are some situations where the way in which the information has been stored makes automated extraction difficult, for example ‘weather-gleam/glim’, ‘champian/-ion’, ‘(ge)hawian (on/to)’. In these cases the full version of the word / phrase is not repeated after the slash, and it would be very difficult to establish rules to determine what the script would do with the part after the slash. Christian, Marc and I met on Thursday to discuss what might be done about this, including using a list of ‘stop words’ that the search script would ignore (e.g. prepositions). I will also look into situations where hyphens appear after a slash to see if there is a way to automate what happens to these words. It is looking like at least some manual editing of words will be required at some point, however.
During the week I ran my script to generate search terms, resulting in 855,810 forms. The majority of these will have been extracted successfully, and I estimate that there are maybe 3-4000 words that might need to be manually fixed at some point. However, even with these words it is likely that a wildcard search would still successfully retrieve the word in question.
I spent most of my remaining time on HT matters working on the category selection page and the quick search. I have now managed to get a quick search up and running that searches words and category headings and uses asterisks for wildcards at the beginning and end of a search term. The quick search leads to the category selection page which pulls out all matching categories and lexemes. It creates a ‘recommended’ section which includes lexemes where the search term appears in both the lexeme and the category heading, and a big long list of all other returned hits underneath. I have also added in pagination for results too. Marc and Christian are wanting the results list to be split into sections where the search term appears in the lexeme and then where it appears only in the category, which I will do next week. The search is still a bit slow at the moment and I’ll need to look into optimising it soon, either by using more indexes or by generating cached versions of search results.
In addition to this I responded to a query about developing a project website that was sent to me by Charlotte Metheun in Theology and I provided some advice to someone in another part of the university who was wanting to develop a Google Maps style interface similar to the one I made for Archive Services. I also made some further updates to the ICOS 2014 website, adding in the banner and logo images and making a few other visual tweaks. My input into this website is now pretty much complete. I also arranged to meet Jean to discuss finalising the Digital Humanities Network website, and I signed up as a ‘five minute speaker’ for the Digital Humanities website launch. I’ll be talking about the redevelopment of the STELLA Teaching resources.
Another predominantly Historical Thesaurus based week this week, with lots of progress being made on both the front end and the database. After a detailed email discussion with Marc I got a much clearer picture about what is required from the front end in terms of colour schemes, logos and fonts, and I have now completed a fifth (and hopefully for the most part final) design. After finalising that I also set up a skeleton structure for the new site, creating a PHP template script that generates all of the interface, page headings, navigation bars and breadbrumbs, and individual pages where the actual content will reside. This site structure will allow the site to be maintained very easily as all structure / design elements are contained in one single template script. Completely changing the site design, overhauling the navigation options or adding lots of new pages will not be a problem. The empty site, awaiting content, can be found here:
This week Christian and Marc emailed their search and browse requirements document, and I spent quite a bit of time going through this and creating a bit long list of questions and comments (three pages of questions and comments for a two page document). On Friday we had a two hour meeting to discuss my questions, which was hugely useful. I think we all now agree exactly how the search and browse options should operate, and I will be able to begin working on this straight away.
Also this week I noticed some errors with the MySQL database structure that I had set up to hold the data exported from Access. Somehow I had managed to miss out three Parts of Speech, and when these occurred in the data they were being converted to blank fields. I updated the structure and reimported the data, which was actually a very worthwhile process as this time I documented the whole procedure required, including which upload scripts to run and what order to run them in. This will be a very useful document to reference in future. Another strange thing with the data is that yogh characters (ȝ) weren’t appearing in the MySQL data but had been converted to question marks, even though the database was set to use UTF-8 and other unusual characters like ashes and thorns had been uploaded fine. Thankfully Flora was able to supply me with a list of word IDs that included yoghs and I was able to create another little script that fixed these errors.
On Friday afternoon I created another new database table that will hold the word forms that will be used for search purposes. As was discussed with Marc and Christian, some words have multiple forms, split with a slash, and other forms using brackets. For example ‘Palæarctic/Palearctic’ and ‘brin(e)y’. A non-wildcard search won’t find such forms, and even using a wildcard it would be difficult to find the latter. Instead we decided that I would create another table for holding all the variations of each word, with each row having a foreign key linking in to one single word ID. The search will then use this table, allowing a user to (for example) search for ‘briney’ or ‘briny’ and find the same word in each case. I am still working on the script to populate this table at the moment as there are (perhaps inevitably) some inconsistencies with the data. I will continue with this on Tuesday next week, as Monday is a holiday (woo!).