Week Beginning 29th April 2019

I worked on several different projects this week.  First of all I completed work on the new Medical Humanities Network website for Gavin Miller.  I spent most of last week working on this but didn’t quite manage to get everything finished off, but I did this week.  This involved completing the front-end pages for browsing through the teaching materials, collections and keywords.  I still need to add in a carousel showing images for the project, and a ‘spotlight on…’ feature, as are found on the homepage of the UoG Medical Humanities site, but I’ll do this later once we are getting ready to actually launch the site.  Gavin was hoping that the project administrator would be able to start work on the content of the website over the summer, so everything is in place and ready for them when they start.

With that out of the way I decided to return to some of the remaining tasks in the Historical Thesaurus / OED data linking.  It had been a while since I last worked on this, but thankfully the list of things to do I’d previously created was easy to follow and I could get back into the work, which is all about comparing dates for lexemes between the two datasets.  We really need to get further information from the OED before we can properly update the dates, but for now I can at least display some rows where the dates should be updated, based on the criteria we agreed on at our last HT meeting.

To begin with I completed a ‘post dating’ script.  This goes through each matched lexeme (split into different outputs for ‘01’, ‘02’ and ‘03’ due to the size of the output) and for each it firstly changes (temporarily) any OED dates that are less than 1100 to 1100 and any OED dates that are greater than 1999 to 2100.  This is so as to match things up with the HT’s newly updated Apps and Appe fields.  The script then compares the HT Appe and OED Enddate fields (the ‘Post’ dates).  It ignores any lexemes where these are the same.  If they’re not the same the script outputs data in colour-coded tables.

In the Green table were lexemes where Appe is greater or equal to 1150, Appe is less than or equal to 1850 and Enddate is greater than Appe and the difference between Appe and Enddate is no more than 100 years OR Appe is greater than 1850 and Enddate is greater than Appe.  The yellow table contains lexemes (other than the above) where Enddate is greater than Appe and the difference between Appe and Enddate is between 101 and 200.  In the orange table there are lexemes where the Enddate is greater than Appe and the difference between Appe and Enddate is between 201 and 250, while the red table contained lexemes where the Enddate is greater than Appe and difference between Appe and Enddate is more than 200.  It’s a lot of data, and fairly evenly spread between tables, but hopefully it will help us to ‘tick off’ dates that should be updated with figures from the OED data.

I then created an ‘ante dating’ script that looks at the ‘before’ dates (based on OED Firstdate (or ‘Sortdate’ as they call it) and HT apps.  This looks at rows where Firstdate is earlier than Apps and splits the data up into colour coded chunks in a similar manner to the above script.  I then created a further script that identifies lexemes where there is a later first date or an earlier end date in the OED data for manual checking, as such dates are likely to need investigation.

Finally, I create a script that brings back a list of all of the unique date forms in the HT.  This goes through each lexeme and replaces individual dates with ‘nnnn’, then strings all of the various (and there are a lot) date fields together to create a date ‘fingerprint’.  Individual date fields are separated with a bar (|) so it’s possible to extract specific parts.  The script also made a count of the number of times each pattern was applied to a lexeme.  So we have things like ‘|||nnnn||||||||||||||_’ which is applied to 341,308 lexemes (this is a first date and still in current use) and ‘|||nnnn|nnnn||-|||nnnn|nnnn||+||nnnn|nnnn||’ which is only used for a single lexeme.  I’m not sure exactly what we’re going to use this information for, but it’s interesting to see the frequency of the patterns.

I spent most of the rest of the week working on the DSL.  This included making some further tweaks to the WordPress version of the front-end, which is getting very close to being ready to launch.  This included updating the way the homepage boxes work to enable staff to more easily control the colours used and updating the wording for search results.  I also investigated an issue in the front end whereby slightly different data was being returned for entries depending on the way in which the data was requested.  Using dictionary ID (e.g. https://dsl.ac.uk/entry/dost44593) brings back some additional reference text that is not returned when using the dictionary and href method (e.g. https://dsl.ac.uk/entry/dost/proces_n).  It looks like the DSL API processes things differently depending on the type of call, which isn’t good.  I also checked the full dataset I’d previously exported from the API for future use and discovered it is the version that doesn’t contain the full reference text, so I will need to regenerate this data next week.

My main DSL task was to work on a new version of the API that just uses PHP and MySQL, rather than technologies that Arts IT Support are not so keen on having on their servers.  As I mentioned, I had previously run a script that got the existing API to spit out its fully generated data for every single dictionary entry and it’s this version of the data that I am currently building the new API around.  My initial aim is to replicate the functionality of the existing API and plug a version of the DSL website into it so we can compare the output and performance of the new API to that of the existing API.  Once I have the updated data I will create a further version of the API that uses this data, but that’s a little way off yet.

So far I have completed the parts of the API for getting data for a single entry and the data required by the ‘browse’ feature.  Information on how to access the data, and some examples that you can follow, and included in the API definition page.  Data is available as JSON (the default as used by the website) and CSV (which can be opened in Excel).  However, while the CSV data can be opened directly in Excel any Unicode characters will be garbled, and long fields (e.g. the XML content of long entries) will likely be longer than the maximum cell size in Excel and will break onto new lines.

I also replicated the WordPress version of the DSL front-end here and set it up to work with my new API.  As of yet the searches don’t work as I haven’t developed the search parts of the API, but it is possible to view individual entries and use the ‘browse’ facility on the entry page.  These features use the new API and the new ‘fully generated’ data.  This will allow staff to compare the display of entries to see if anything looks different.

I still need to work on the search facilities of the API, and this might prove to be tricky.  The existing API uses Apache Solr for fulltext searching, which is a piece of indexing software that is very efficient for large volumes of text.  It also brings back nice snippets showing where results are located within texts.  Arts IT Support don’t really want Solr on their servers as it’s an extra thing for them to maintain.  I am hoping to be able to develop comparable full text searches just using the database, but it’s possible that this approach will not be fast enough, or pinpoint the results as well as Solr does.  I’ll just need to see how I get on in the coming weeks.

I also worked a little bit on the RNSN project this week, adding in some of the concert performances to the existing song stories.  Next week I’m intending to start on the development of the front end for the SCOSYA project, and hopefully find some time to continue with the DSL API development.