I got stuck into the re-importing of the Historical Thesaurus data this week, a massive undertaking with lots of different steps involved. Thankfully the last time I performed this task I documented the steps that were involved, which made things easier, although a number of these steps needed to be updated. We’d previously noticed a strange situation whereby Old English words that had initial ashes and thorns were losing these characters somewhere between the export from Access and the import into MySQL. I managed to track down what was causing this, which turned out to be a bug in PHP itself (see https://bugs.php.net/bug.php?id=55507). The function used to process CSV files doesn’t like unusual characters at the beginning of fields and these characters vanish. This was a bit of a problem as I was relying on this function to get access to the data. I fixed things by a simple hack – I added some extra characters to the beginning of the necessary fields, then stripped these out again after PHP had done its processing.
I also tackled and completed most of the other data related tasks this week as well, such as stripping out the initial full stops from the subcategory names, ensuring search term variants were generated with and without apostrophes and also that search term variants were generated for words with hyphens (additional forms were created with a space instead of the hyphen and also with no character – e.g. fire-place, fire place, fireplace).
I also managed to get a bit of XSLT working with the XML representation of the HT data in order to extract an up to date list of categories that have no words. These are not present in the Access database but are needed for the website in order to allow users to browse up and down the hierarchy. The XSLT worked well and I managed to get my hands on the more than 10,000 empty categories that exist. Running these through a little PHP script resulted in the categories getting added to the database.
I also did quite a bit of further work with search term variants. I ensured the full word appears in the search term table, in addition to any forms split off from this. I also completely reworked the scripts that deal with brackets so that every permutation of bracketed letters gets saved as a search term – not just ‘all bracketed letters’ and ‘no bracketed letters’. It was pretty tricky to develop this script but it appears to work well. I also reworked the script that deals with hyphens in words and managed to automatically build up full versions of words where only partial sections were located – e.g. wood-sear/-seer/-sere is split up to give three variants: wood-sear, wood-seer and wood-sere. There are still just over 300 hyphen words that will need to be manually fixed, but that’s not too bad considering there are almost 100,000 hyphenated words in the system.
I also began looking into the stop words and I created a script that can process these – removing words such as ‘the’ and storing a new variant without it. However, I need feedback from Marc and Christian as to how this should work before I execute it. I also looked into updating the date fields, but again I need to speak to Marc about these before I proceed. I did make a start updating the layout of the date search boxes, but I haven’t finished this yet.
Some non-HT things I did this week included attending a Burns seminar, which was hugely interesting, very enjoyable and a good opportunity to catch up with the Burns people. I also had a meeting with Charlotte Methuen about a possible project she is putting together. I had an email conversation with Nigel Leask about a project he is developing and I make a couple of minor tweaks to the ICOS 2014 website for Daria too.