I had expected to return to SCOSYA duties this week as we had a team meeting scheduled for Monday morning, but the meeting was pushed back t next Wednesday so in the meantime I decided to focus on other projects instead. I did tick off one item on my SCOSYA ‘to do’ list this week, though: I managed to track down a bug which was causing the CSV view of the map data to include an extra column for some data, which was offsetting some of the rows and thus messing up the structure. The problem was caused by the ‘data that doesn’t meet your criteria’ being included. This data includes an extra field so that the system knows that its points on the map should be coloured grey. In the CSV view this extra field was being added to the middle of each spreadsheet row, which was pushing everything else along one cell. I decided to remove this data from the CSV as it’s not really necessary to include it. The CSV instead should focus on the relevant data, not the stuff that doesn’t match.
I had to undertake a few more Apple developer account related tasks this week, which took up a bit of time. I also spent some time doing further AHRC review duties and responded to a request from Sarah Jones of the DCC regarding a workshop about Technical Plans. I also fixed some issues in the rather ancient ‘Learning with the online Thesaurus of Old English’ website. I spent the best part of Wednesday working on the REELS project. One of the project team would like export and import facilities to be added to the CMS to enable the data to be exported to an Excel compatible format. As the structure of the data is rather complex (13 inter-related tables), converting this to a flat spreadsheet is going to be rather tricky and will take a lot of planning. I spent several hours thinking about the implications and replying to the team member’s original email. Once I receive some clarification on a number of issues I raised I will set about implementing the required feature.
On Tuesday I met with Marc and Fraser to discuss updates to the Historical Thesaurus based on the new data supplied to us from the OED people. We discussed the scripts I had already created to analyse the data and considered our next steps, which for me would involve the creation of further scripts that Fraser would then be able to access in order to check the data. It is going to be rather tricky to ensure all of the correct changes are implemented. For example, the OED has revised dates for words and we generally want to use these instead of the dates we currently have in the HT, but in some particular cases our dates will still be the more accurate. There are also issues relating to matching up words in the HT with words in the OED data as sometimes words have changed category, or words are formatted differently (hyphens or spaces etc) and therefore don’t match up.
My first task following the meeting was to update the scripts I had created that show which categories in the two datasets don’t match up. The OED data includes lots of empty top-level categories that don’t have a part of speech, and as these would never match up with any HT category I removed them from my script. I also updated the script I created that lists words within matching HT / OED categories so that words are converted to lower case before comparison happens as there were some mismatches between the datasets (e.g. ‘Arctic Circle’ and ‘Arctic circle’ weren’t being matched).
I then thought it might be useful to be able to view all of the words within categories that didn’t match across the two datasets, as a means of possibly identifying where problems might be and which category should match which. Marc emailed me on Thursday with an urgent request for some figures relating to the updates we’re doing so I got those to him as soon as I could. I then set about creating a new script that looks at the distinct words in the HT database and then compares every ‘sense’ of the word across the two datasets. Basically counting the number of categories each word appears in within each dataset and seeing how these differ. The script outputs three tables of data. The first set shows all those words that appear more often in HT categories than OED categories (with counts of categories and details of each category included). The second set lists words that appear more often in OED categories than HT categories. The this lists those words that appear in the same number of categories, but only those ones where those listed categories are not exactly the same in each set (i.e. pinpointing where words have moved category).
In the set for ‘appears more in OED categories than HT categories’ I also included any distinct words found in the OED dataset that were not present at all in the HT dataset. Rather surprisingly there were 45804 words in the OED data that were not found in the HT data. This is surprising because there are 369888 distinct non-OE words in the HT data and only 322635 distinct words in the new OED data. It took a bit more investigation to figure out what was going on, but it looks like a lot of the remaining ‘new’ words are phrases, or have brackets, or hyphens and I suspect a lot of these are just variants of words we already have in the HT rather than being new words per se. I think we’re going to have to be careful when adding ‘new’ words.
On Friday I managed to complete work on these new scripts and save the output as static HTML tables for further analysis by Fraser. There are 57420 words that appear more often in the HT data than the OED data, 59085 words that appear more often in the OED data (including ‘new’ words), and 11959 words that appear the same number of times in each dataset but appear in different categories. That’s a lot of analysis to carry out, but hopefully we can come up with a series of rules to properly align the majority of these.