I had two Zoom calls this week, the first on Wednesday with Kirsteen McCue to discuss a new, small project to publish a selection of musical settings to Burns poems and the second on Friday with Joanna Kopaczyk and her RA on the Scots Language Policy project to give a tutorial on how to use WordPress.
The majority of my week was divided between the Anglo-Norman Dictionary, the Dictionary of the Scots Language and the Place-names of Iona projects. For the AND I made a few tweaks to the static content of the site and migrated some more blog posts across to the new site (these are not live yet). I also added commentaries to more than 260 entries, which took some time to test. I also worked on the DTD file that the editors reference from their XML editing software to ensure that all of the elements and attributes found within commentaries are ‘allowed’ in the XML. Without doing this it was possible to add the tags in, but this would give errors in the editing software. I also batch updated all of the entries on the site to reference the new DTD and exported all of the files, zipped them up and sent them to the editors so they can work on them as required. I also began to think about migrating the TextBase from the old site to the new one, and managed to source the XML files that comprise this system. It looks like it may be quite tricky to work with these as there are more than 70 book-length XML files to deal with and so far I have not managed to locate the XSLT that was originally used to process these files.
For the DSL I completed work on the new bibliography search pages that use the new ‘V4’ data. These pages allow the authors and titles of bibliographical items to be searched, results to be viewed and individual items to be displayed. I also made some minor tweaks to the live site and had a discussion with Ann Fergusson about transferring the project’s data to the people who have set up a new editing interface for them, something I’m hoping to be able to tackle next week.
For the Place-names of Iona project I had a discussion about implementing a new ‘work of the month’ feature and spent quite a bit of time investigating using 10-digit OS grid references in the project’s CMS. The team need to use up to 10-digit grid references to get 1m accuracy for individual monuments, but the library I use in the CMS to automatically generate latitude and longitude from the supplied grid reference will only work with a 6-digit NGR. The automatically generated latitude and longitude are then automatically passed to Google Maps to ascertain the altitude of the location and all of this information is stored in the database whenever a new place-name record is created or an existing record is edited.
As the library currently in use will only accept 6-digit NGRs I had to do a bit of research into alternative libraries, and I managed to find one that can accept NGRs of 2,4,6,8 or 10 digits. Information about the library, including text boxes where you can enter an NGR and see the results can be found here: http://www.movable-type.co.uk/scripts/latlong-os-gridref.html along with an awful lot of description about the calculations and some pretty scary looking formulae.
This does mean the person filling out the form can see the generated latitude and longitude and also tweak it if required before submitting the form, which is a potentially useful thing. I may even be able to add a Google Map to the form so you can see (and possibly tweak) the point before submitting the form, but I’ll need to look into this further. I also still need to work on the format of the latitude and longitude as the new library generates them with a compass point (e.g. 6.420848° W) and we need to store them as a purely decimal value (e.g. -6.420848) with ‘W’ and ‘S’ figures being negatives.
However, whilst researching this I discovered a potentially worrying thing that needs discussion with the wider team. The way the Ordnance Survey generates latitude and longitude from their grid references was changed in 2014. Information about this can be found in the page linked to above in the ‘Latitude/longitudes require a datum’ section. Previously the OS used ‘OSGB-36’ to generate latitude and longitude, but in 2014 this was changed to ‘WGS84’, which is used by GPS systems. The difference in the latitude / longitude figures generated by the two systems is about 100 metres, which is quite a lot if you’re intending to pinpoint individual monuments.
The new library has facilities to generate latitude and longitude using either the new or old systems, but defaults to the new system. I’ve checked the output of the library we currently use and it uses the old ‘OSGB-36’ system. This means all of the place-names in the system so far (and all those for the previous projects) have latitudes and longitudes generated using the now obsolete (since 2014) system. To give an example of the difference, the place-name A’ Mhachair in the CMS has this location: https://www.google.com/maps/place/56%C2%B019’33.2%22N+6%C2%B025’11.4%22Wfirstname.lastname@example.org,-6.422022,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325885!4d-6.419828 and with the newer ‘WGS84’ system it would have this location: https://www.google.com/maps/place/56%C2%B019’32.7%22N+6%C2%B025’15.1%22Wemail@example.com,-6.4230367,582m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d56.325744!4d-6.420848
So what we need to decide before I replace the old library with the new one in the CMS is whether we switch to using ‘WGS84’ or we keep using ‘OSGB-36’. As I say, this will need further discussion before I implement any changes.
Also this week I responded to a query from Cris Sarg of the Medical Humanities Network project, spoke to Fraser Dallachy about future updates to the HT’s data from the OED, made some tweaks to the structure of the SCOSYA website for Jennifer Smith, added a plugin to the Editing Burns site for Craig Lamont and had a chat with the Books and Borrowing people about cleaning the authors data, importing the Craigston data and how to deal with a lot of borrowers that were excluded from the Selkirk data that I previously imported.
Next week I’ll be on holiday from Monday to Wednesday to cover the school half term.
After a wonderful three weeks’ holiday I returned to work on Monday this week. I’d been keeping track of my emails whilst I’d been away so although I had a number of things waiting for me to tackle on my return at least I knew what they were, so returning to work wasn’t as much of a shock as it might otherwise have been. The biggest item waiting for me to get started on was a request from Gerry Carruthers to write a Data Management Plan for an AHRC proposal he’s putting together. He’d sent me all of the bid documentation so I read through that and began to think about the technical aspects of the project, which would mainly revolve around the creation of TEI-XML digital editions. I had an email conversation with Gerry over the course of the week where I asked him questions and he got back to me with answers. I’d also arranged to meet with my fellow developer Luca Guariento on Wednesday as he has been tasked with writing a DMP and wanted some advice. This was a good opportunity for me to ask him some details about the technology behind the digital editions that had been created for the Curious Travellers project, as it seemed like a good idea to reuse a lot of this for Gerry’s project. I finished a first version of the plan on Wednesday and sent it to Gerry, and after a few further tweaks based on feedback a final version was submitted on Thursday.
Also this week I met with Head of School Alice Jenkins to discuss my role in the School, a couple of projects that have cropped up that need my input and the state of my office. It was a really useful meeting, and it was good to discuss the work I’ve done for staff in the School and to think about how my role might be developed in future. I spent a bit of time after the meeting investigating some technology that Alice was hoping might exist, and I also compiled a list of all of the current Critical Studies Research and Teaching staff that I’ve worked with over the years. Out of 104 members of staff I have worked with 50 of them, which I think is pretty good going, considering not every member of staff is engaged in research, or if they are may not be involved with anything digital.
I spent some more time this week working on the pilot website for 18th Century Borrowers for Matthew Sangster. We met on Wednesday morning and had a useful meeting, discussing the new version of the data that Matt is working on, how my import script might be updated to incorporate some changes and investigating why some of the error rows that were outputted during my last data import were generated and how these could be addressed. We also went through the website I’d created, as Matt had uncovered a couple of bugs, such as the order of the records in the tabular view of the page not matching up with the order on the scanned image. This turned out to have been caused by the tabular order depending on an imported column that was set to hold general character data rather than numbers, meaning the database arranged all of the ones (1,10,11 etc) then all of the twos (2, 21,22 etc) rather than arranging things in proper numerical order. I also realised that I hadn’t created indexes for a lot of the columns in the database that were used in the queries, which was making the queries rather slow and inefficient. After generating these indexes the various browses are now much speedier.
I also added authors under book titles in the various browse lists, which helps to identify specific books and created a new section of the website for frequency lists. There are now three ‘top 20’ lists, which show the most frequently borrowed books and authors, and the student borrowers who borrowed the most books. Finally, I created the search facility for the site, allowing any combination of book title, author, student, professor and date of lending to be combined and for the results of the search to be displayed. This took a fair amount of time to implement, but I managed to get the URL for the page to Matt before the end of the week.
Also this week I investigated and fixed a bug that the Glasgow Medical Humanities Network RA Cris Sarg was encountering when creating new people records and adding these to the site, I responded to a query from Bryony Randall about the digital edition we had made for the New Modernist Editing project, spoke to Corey Gibson about a new project he’s set up that will be starting soon and I’ll be creating the website for, had a chat with Eleanor Capaldi about a project website I’ll be setting up for her, responded to a query from Fraser about access to data from the Thesaurus of Old English and attended the Historical Thesaurus birthday drinks. I also read through the REF digital guidelines that Jennifer Smith had sent on to me and spoke to her about the implications for the SCOYA project, helped the SCOSYA RA Frankie MacLeod with some issues she was encountering with map stories and read through some feedback on the SCOSYA interfaces that had been sent back from the wider project team. Next week I intend to focus on the SCOSYA project, acting on the feedback and possibly creating some non-map based ways of accessing the data.
I spent some time this week investigating the final part of the SCOSYA online resource that I needed to implement: A system whereby researchers could request access to the full audio dataset and a member of the team could approve the request and grant the person access to a facility where the required data could be downloaded. Downloads would be a series of large ZIP files containing WAV files and accompanying textual data. As we wanted to restrict access to legitimate users only I needed to ensure that the ZIP files were not directly web accessible, but were passed through to a web accessible location on request by a PHP script.
I created a test version using a 7.5Gb ZIP file that had been created a couple of months ago for the project’s ‘data hack’ event. This version can be set up to store the ZIP files in a non-web accessible directory and then grab a file and pass it through to the browser on request. It will be possible to add user authentication to the script to ensure that it can only be executed by a registered user. The actual location of the ZIP files is never divulged so neither registered nor unregistered users will ever be able to directly link to or download the files (other than via the authenticated script).
This all sounds promising but I realised that there are some serious issues with this approach. HTTP as used by web pages to transfer files is not really intended for downloading huge files and using this web-based method to download massive zip files is just not going to work very well. The test ZIP file I used was about 7.5Gb in size (roughly the size of a DVD), but the actual ZIP files are likely to be much larger than this – with the full dataset taking up about 180Gb. Even using my desktop PC on the University network it’s taken roughly 30 minutes to download the 7.5Gb file. Using an external network would likely take a lot longer and bigger files are likely to be pretty unmanageable for people to download.
It’s also likely that a pretty small number of researchers will be requesting the data, and if this is the case then perhaps it’s not such a good idea to take up 180Gb of web server space (plus the overheads of backups) to store data that is seldomly going to be accessed, especially if this is simply replicating data that is already taking up a considerable amount of space on the shared network drive. 180Gb is probably more web space than is used by most other Critical Studies websites combined. After discussing this issue with the team, we decided that we would not set up such a web-based resource to access the data, but would instead send ZIP files on request to researchers using the University’s transfer service, which allows files of up to 20Gb to be sent to both internal and external email addresses. We’ll need to see how this approach works out, but I think it’s a better starting point than setting up our own online system.
I also spent some further time on the SCOSYA project this week implementing some changes to both the experts and the public atlases based on feedback from the team. This included changing the default map position and zoom level, replacing some of the colours used for map markers and menu items, tweaking the layout of the transcriptions, ensuring certain words in story titles can appear in bold (as opposed to the whole title being bold as was previously the case) and removing descriptions from the list of features found in the ‘Explore’ menu in the public atlas. I also added a bit of code to ensure that internal links from story pages to other parts of the public atlas would work (previously they weren’t doing anything because only the part after the hash was changing). I also ensured that the experts atlas side panel resizes to fit the content whenever an additional attribute is added or removed.
Also this week I finally found a bit of time to fix the map on the advanced search page of the SCOTS Corpus website. This map was previously powered by Google Maps, but they have now removed free access to the Google Maps service (you now need to provide a credit card and get billed if your usage goes over a certain number of free hits a month). As we hadn’t updated the map or provided such details Google broke the map, covering it with warning messages and removing our custom map styles. I have now replaced the Google Maps version with a map created using the free to use Leaflet,js mapping library (as I’m using for SCOSYA) and a free map tileset from OpenStreetMap. Other than that it works in exactly the same way as the old Google Map. The new version is now live here: https://www.scottishcorpus.ac.uk/advanced-search/.
Also this week I upgraded all of the WordPress sites I manage, engaged in some App Store duties and had a further email conversation with Marc Alexander about how dates may be handled in the Historical Thesaurus. I also engaged in a long email conversation with Heather Pagan of the Anglo-Norman Dictionary about accessing the dictionary data. Heather has now managed to access one of the servers that the dictionary website runs on and we’re now trying to figure out exactly where the ‘live’ data is located so that I can work with it. I also fixed a couple of issues with the widgets I’d created last week for the GlasgowMedHums project (some test data was getting pulled into them) and tweaked a couple of pages. The project website is launching tomorrow so if anyone wants to access it they can do so here: https://glasgowmedhums.ac.uk/
Finally, I continued to work on the new API for the Dictionary of the Scots Language, implementing the bibliography search for the ‘v2’ API. This version of the API uses data extracted from the original API, and the test website I’ve set up to connect to it should be identical to the live site, but connects to the ‘v2’ API to get all of its data and in no way connects to the old, undocumented API. API calls to search the bibliographies (both a predictive search used for displaying the auto-complete results and to populate a full search results page), and to display an individual bibliography are now available and I’ve connected the test site to these API calls, so staff can search for bibliographies here.
Whilst investigating how to replicate the original API I realised that the bibliography search on the live site is actually a bit broken. The ‘Full Text’ search simply doesn’t work, but instead just does the same as a search for authors and titles (in fact the original API doesn’t even include a ‘full text’ option). Also, results only display authors, so for records with no author you get some pretty unhelpful results. I did consider adding in a full-text search, but as bibliographies contain little other than authors and titles there didn’t seem much point, so instead I’ve removed the option. As the search is primarily set up as an auto-complete, which is set up to match words in authors or titles that begin with the characters that are being typed (i.e. a wildcard search such as ‘wild*’) and the full search results page only gets displayed if someone ignores the auto-complete list of results and manually presses the ‘Search’ button, I’ve made full search results page always work as a ‘wild*’ search too. So typing ‘aber’ into the search box and pressing ‘Search’ will bring up a list of all bibliographies with titles / authors featuring a word beginning with these characters. With the previous version this wasn’t the case – you had to add a ‘*’ after ‘aber’ otherwise the full search results page would match ‘aber’ exactly and find nothing. I’ve updated the help text on the bibliography search page to explain this a bit.
The full search results (and the results side panel) in the new version now include titles as well as authors, which makes things clearer and I’ve also made the search results numbering appear at the top of the corresponding result text rather than on the last line. This is also the case for entry searches too. Once the test site has been fully tested and approved we should be able to replace the live site with the new site (ensuring all WordPress content from the live site is carried over, of course). Doing so will mean the old server containing the original API can (once we’re confident all is well) be switched off. There is still the matter of implementing the bibliography search for the V3 data, but as mentioned previously this will probably be best tackled once we have sorted out the issues with the data and we are getting ready to launch the new version.
I spent most of this week continuing to work on the Experts Atlas for the SCOSYA project, focussing primarily on implementing an alternative means of selecting attributes that allows attributes to be nested in two levels rather than just one. Implementing this has been a major undertaking as basically the entire attribute search was built around the use of drop-down lists (HTML <select> lists with <optgroups> items for parents and <option> items for attributes). Replacing this meant replacing how the code figures out what is selected, how the markers and legend are generated, how the history feature works and how the reloading of an atlas based on variables passed in the URL works. It took the best part of a day and a half to implement and it’s been pretty hellish.
The two levels of nesting rely on how the code parent is recorded in the CMS. Basically if you have two parts of a code parent separated by a space, a dash and a space ‘ – ‘ then the Experts Atlas will treat the part before the dash as the highest level in the hierarchy and the part after as another level down. So ‘AFTER-PREFECT’ is just one level whereas ‘AGREEMENT – MEASURES’ is two levels – ‘AGREEMENT’ as the top level and ‘MEASURES’ as the second.
In the Atlas I’ve replaced the drop-down list with a scrollable area that contains all of the top level code parents, each with a ‘+’ icon. Press on one of these and an area will scroll down containing either the attributes or the second-level parent headings. If it’s the latter, press on one of these to open it and see the attributes. You can press on a heading a second time to hide the contents.
Each attribute is listed with its code, title and description (if available). Clicking on an attribute will select it, and the details will appear in a ‘Selected Attribute’ section underneath the scrollable area. In this initial version everything else about the attribute search was the same as before, including the option to add another and supply limit options. I Also decided against replacing the ‘Age’ and ‘Rated by’ drop-down lists with buttons as I’m a bit worried about having too many similar looking buttons. E.g. if the ‘rated by’ drop-down list was replaced by buttons 1-4 it would look very similar to the ‘score’ option’s 1-5 buttons which could get confusing. Also, although the new attribute selection allows nesting of attributes, allows descriptions to be more easily read and allows us to control the look of the section (as opposed to an HTML select element which has its behaviour defined by the web browser) it does take up more space and might actually be less easy to use. The fixed height of it feels quite claustrophobic when using it, it’s more difficult to just see all of the attributes, and once the desired attribute has been selected it feels like the scrollable area is taking up a lot of unnecessary space. We’ll maybe need to see what the feedback is like when people use the atlas.
After a bit of further playing around with the interface I decided to make the limit options always visible, which I think works better and doesn’t take up much more vertical space. I’ve moved the ‘Show’ button to be in line with the ‘add another’ button and have relabelled both ‘Show on map’ (with magnifying glass icon) and ‘Add another attribute’. moved the ‘remove’ button when multiple attributes are selected to the same line as the Boolean choices, which saves space and works better. Here’s a screenshot of the new attribute select feature:
I then started working on the final Experts Atlas feature: the group statistics. I added in an option for loading in the set of default groups that E had created previously. If you perform an attribute search and then go to the Group Statistics menu item you can now press one or more ‘Highlight’ button to highlight the group. There are 5 different highlight colours (I can change these or add more as required) and they cycle through, so if you don’t like one you can keep clicking to find one you like. The group row is given the highlight colour as a background to help keep track of which group is which on the map. Highlighting works for all attribute search types but doesn’t work for the ‘Home’ map as it depends on contents of the pop-ups to function and the ‘Home’ map has no popups. Pressing on the ‘Statistics’ button opens up a further sliding panel and functions in the same way as the stats in the CMS Atlas, listing the attributes, giving stats for each and displaying a graph of individual score frequency. The following screenshot shows a map with highlighting and the stat panel open:
I noticed there were some issues with the stats panel staying open and displaying data that was no longer relevant when other menus were opened. I therefore fixed things to ensure that statistics and highlighting are removed when navigating between menus (but highlighting remains when changing an attribute search so you can more easily see the effect on your highlighted locations). I also fixed a bug in the statistics when ‘young’ or ‘old’ were selected that meant the count of ratings was not entirely accurate. In doing so I uncovered the issue with 11 questionnaires being of people that were too young to be ‘old’, which was causing problems for the display of data. Changing the year criteria in the API has fixed that.
I spent the rest of my time on the project working on the facilities to allow users to make their own groups. This should now be fully operational. This uses HTML5 LocalStorage, which is a way of storing data in the user’s browser. No data relating to the user’s chosen locations or group name is stored on the server and no logging in or anything is required to use the feature. However, data is stored in the user’s browser so if they use a different browser they won’t see their groups. Also, they need to have HTML5 LocalStorage enabled and supported in their browser. It is supported by default in all modern browsers so this shouldn’t be an issue for most people. If their browser doesn’t support it they simply won’t see the options to save their own groups.
If you open the group statistics menu, above the default groups is a section for ‘My saved groups’. If you click on the ‘create a group’ button you can create a group in the same way as you could in the CMS atlas – entering a name for your group and clicking on map markers to add a location (or clicking a second time to remove one). Once the ‘Save’ button is pressed your group will appear in the list of saved groups, and you will be able to highlight the markers and view statistics for it as you can with the default groups. There are also options to delete your group (there is currently no confirmation for this – if you press the button your group is immediately deleted) and edit your group, which returns you to the ‘create’ view but pre-populated with your data. You can rename your group or change the selected markers via this feature. I think that’s pretty much everything I needed to implement for the Experts Atlas, so next week I’ll press on with the facility to allow certain users to download the full dataset. Here’s a screenshot of the ‘My Groups’ feature:
Also this week I spent a bit of time on the Glasgow Medical Humanities website for Gavin Miller. This is due to go live in the next few weeks so I focussed on the last things that needed implemented. This included migrating the blog data from the old MHRC blog, which I managed to do via WordPress’s own import / export facilities. This worked pretty well, importing all of the text, the author details and the categories. The only thing it didn’t do was migrate the media, but Gavin has said this can be done manually as required so it’s not such a big deal. I also met with the project administrator on Friday to talk through the site and the CMS and discuss some of the issues she’d been encountering in accessing the site. I also found out a way of exporting the blog subscribers from the MHRC site so Gavin can let them know about the new site. I also created new versions of the ‘Spotlight on…’ and ‘Featured images’ from the old Medical Humanities site. I created these as WordPress widgets, meaning they can be added to any WordPress page simply by adding in a shortcode. I based the featured image carousel on Bootstrap, which includes a very nice carousel. Currently I’ve added both features to the bottom of the homepage, but these can easily be moved elsewhere. Here’s an example of how they look:
Other than the above I got into a discussion with various people across the University about a policy for publishing apps, responded to a request for help with audio files from Rob Maslen, had a chat with Gerry McKeever about the interactive map he wants to create for his project, spoke to Heather Pagan about the Anglo-Norman Dictionary data, helped Craig Lamont make some visual changes to the Editing Burns blog, replied to a query from Marc Alexander about how dates might be handled differently in the Historical Thesaurus, and had an email conversation with Ann Ferguson about the DSL bibliographical data.
I had my PDR session this week, so I needed to spend some time preparing for it, attending it, and reworking some of my PDR sections after it. I think it all went pretty well, though, and it’s good to get it over with for another year. I had one other meeting this week, with Sophie Vlacos from English Literature. She is putting a proposal together and I get her some advice on setting up a website and other technical matters.
My main project of the week once again was SCOSYA, and this week I was able to really get stuck into the Experts Atlas interface, which I began work on last week. I’ve set up the Experts Atlas to use the same grey map as the Public Atlas, but it currently retains the red to yellow markers of the CMS Atlas. The side panel is slightly wider than the Public Atlas and uses different colours, taken from the logo. The fractional zoom from the Public Atlas is also included, as is the left-hand menu style (i.e. not taking the full height of the Atlas). The ‘Home’ map shows the interview locations, with each appearing as a red circle. There are no pop-ups on this map, but the location name appears as a tooltip when hovered over.
The ‘Search Attributes’ option is mostly the same as the ‘Attribute Search’ option in the CMS Atlas. I’ve not yet updated the display of the attributes to allow grouping at three as opposed to two levels, probably using a tree-based approach. This is something I’ll need to tackle next week. I have removed the ‘interviewed by’ option, but as of yet I haven’t changed the Boolean display. At a team meeting we had discussed making the joining of multiple attributes default to ‘And’ and to hide ‘Or’ and ‘Not’ but I just can’t think of a way of doing this without ending up with more clutter and complexity. ‘And’ is already the default option and I personally don’t think it’s too bad to just see the other options, even if they’re not used.
The searches all work in the same way as in the CMS Atlas, but I did need to change the API a little, as when multiple attributes were selected these weren’t being ordered by location (e.g. all the D3 data would display then all the D4 data rather than all the data for both attributes for Aberdeen etc). This was meaning the full information was not getting displayed in the pop-ups. I’ve also completely changed the content of the pop-ups so as to present the data in a tabular format. The average rating appears in a circle to the right of the pop-up, with a background colour reflecting the average rating. The individual ratings also appear in coloured circles, which I personally think works rather well. Changing the layout of the popup was a fairly major undertaking as I had to change the way in which the data was processed, but I’d say it’s a marked improvement on the popups in the CMS Atlas. I removed the descriptions from the popups as these were taking up a lot of space and they can be viewed in the left-hand menu anyway. Currently if a location doesn’t meet the search criteria and is given a grey marker the popup still lists all of the data that is found for the selected attributes at that location. I did try removing this and just displaying the ‘did not meet criteria’ message, but figured it would be more interesting for users to see what data there is and how it doesn’t meet the criteria. Below is a screenshot of the Experts Atlas and an ‘AND’ search selected:
Popups for ‘Or’ and ‘Not’ searches are identical, but for an ‘Or’ search I’ve updated the legend to try and make it more obvious what the different colours and shapes refer to. In the CMS Atlas the combinations appear as ‘Y/N’ values. E.g. if you have selected ‘D3 ratings 3-5’ OR ‘Q14 ratings 3-5’ then locations where neither are found were identified as ‘NN’, locations were the D3 was present at these ratings but Q14 wasn’t were identified as ‘YN’, locations without D3 but with Q14 were ‘NY’ and locations with both were ‘YY’. This wasn’t very easy to understand, so now the legend includes the codes, as the following screenshot demonstrates:
I think works a lot better, but there is a slight issue in that if someone chooses the same code but with different criteria (e.g. ‘D3 rated 4-5 by Older speakers’ OR ‘D3 rated 4-5 by Younger speakers’) the legend doesn’t differentiate between the different ‘D3’s, but hopefully anyone doing such a search would realise the first ‘D3’ relates to their first search selection while the second refers to their second selection.
I have omitted the ‘spurious’ tags from the ratings in the popups and also the comments. I wasn’t sure whether these should be included, and if so how best to incorporate them. I’ve also not included the animated dropping down of markers in the Experts Atlas as firstly it’s supposed to be more serious and secondly the drop down effect won’t work with the types of markers used for the ‘Or’ search. I have also not currently incorporated the areas. We had originally decided to include these, but they’ve fallen out of favour somewhat, plus they won’t work with ‘Or’ searches, which rely on differently shaped markers as well as colour, and they don’t work so well with group highlighting either.
The next menu item is ‘My search log’, which is what I’ve renamed the ‘History’ feature from the CMS Atlas. This now appears in the general menu structure rather than replacing the left-hand menu contents. Previously the rating levels just ran together (e.g. 1234), which wasn’t very clear so I’ve split these up so the description reads something like:
“D3: I’m just after, age: all, rated by 1 or more people giving it at score of 3, 4 or 5 Or Q14: Baremeasurepint, age: all, rated by 1 or more people giving it at score of 3, 4 or 5 viewed at 15:41:00”
As with the CMS Atlas, pressing on a ‘Load’ button loads the search back into the map. The data download option has also been given its own menu item, and pressing on this downloads the CSV version of the data that’s displayed on the map. And that’s as far as I’ve got. The main things still to do are replacing the attribute drop-down list with a three-level tree-based approach and adding in the group statistics feature. Plus I still need to create the facility for managing users who have been authorised to download the full dataset and creating the login / download options for this.
Also this week I made some changes to the still to launch Glasgow Medical Humanities Network website for Gavin Miller. I made some minor tweaks, such as adding in the Twitter feed and links to subscribe to the blog, updated the site text on pages that are not part of the WordPress interface. Gavin also wanted me to grab a copy of all the blogs on another of his sites (http://mhrc.academicblogs.co.uk/) and migrate this to the new site. However, getting access to this site has proved to be tricky. Gavin reckoned the domain was set up by UoG, but I submitted a Helpdesk query about it and no-one in IT knows anything about the site. Eventually someone in the Web Team get back to me to say that the site had been set up by someone in Research Strategy and Innovation and they’d try to get me access, but despite the best efforts of a number of people I spoke to I haven’t managed to get access yet. Hopefully next week, though.
Also this week I continued to work on the 18th Century Borrowing site for Matthew Sangster. I have now fixed the issue with the zoomable images that were landscape being displayed on their side, as demonstrated in last week’s post. All zoomable images should now display properly, although there are a few missing images at the start or end of the registers. I also developed all of the ‘browse’ options for the site. It’s now possible to browse a list of all student borrower names. This page displays a list of all initial letters of the surnames, with a count of the number of students with surnames beginning with the letter. Clicking on a letter displays a list of all students with surnames beginning with the letter, and a count of the number of records associated with each student. Clicking on a student brings up the results page, which lists all of the associated records in a tabular format. This is pretty much identical to the tabular view offered when looking at a page, only the records can come from any page. As such there is an additional column displaying the register and page number of each record, and clicking on this takes you to the page view, so you can see the record in context and view the record in the zoomable image if you so wish. There are links back to the results page, and also links back from the results page to the student page. Here’s an example of the list of students with surnames beginning with ‘C’:
The ‘browse professors’ page does something similar, only all professors are listed on one page rather than being split into different pages for each initial letter of the surname. This is due to there being a more limited number of professors. That there are some issues with the data, which is why we have professors listed with names like ‘, &’. There are what look like duplicates listed as separate professors (e.g. ‘Traill, Dr’) because the surname and / or title fields must have contained additional spaces or carriage returns so the scripts considered the contents to be different. Clicking on a professor loads the results page in the same way as the students page. Note that currently there is no pagination of results, so for example clicking on ‘Mr Anderson’ will display all 1034 associated records in one long table. I might split this up, although in these days of touchscreens people tend to prefer scrolling through long pages rather than clicking links to browse through multiple smaller pages.
‘Browse Classes’ does the same for classes. I also created two new related tables to hold details of the classes, which enables me to pass a numerical ‘class ID’ in the URL rather than the full class text, which is tidier and more easy to control. Again, there are issues with the data that results in multiple entries for what is presumably the same class – e.g. ‘Anat, Anat., Anat:, Anato., Anatom, Anatomy’. Matthew is still working on the data and it might be that creating a ‘normalised’ text field for class is something that we should do.
‘Book Names’ does the same thing for book names. Again, I’ve written a script that extracts all of the unique book names and stores them once, allowing me to pass a ‘book name ID’ in the URL rather than the full text. As with ‘students’ an alphabetical list of book names is presented initially due to the number of different books. And as with other data types, a normalised book name should ideally be recorded as there are countless duplicates with slight variations here, making the browse feature pretty unhelpful as it currently stands. I’ve taken the same approach with book titles, although surprisingly there is less variation here, even though the titles are considerably longer. One thing to note is that any book with a title that doesn’t start with an a-z character is currently not included. There are several that start with ‘….’ And some with ‘[‘ that are therefore omitted. This is because the initial letter is passed in the URL and for security reasons there are checks in place to stop characters other than a-z being passed. ‘Browse Authors’ works in the same way, and generally there don’t appear to be too many duplicate variants, although there are some (e.g. ‘Aeschylus’ and ‘Aeschylus.’), and finally, there is browse by lending date, which groups records by month of lending.
Also this week I added a new section to Bryony Randall’s New Modernist Editing site for her AHRC follow-on funding project: https://newmodernistediting.glasgow.ac.uk/the-imprints-of-the-new-modernist-editing/ and I spent a bit of time on DSL duties too. I responded to a long email from Rhona Alcorn about the data and scripts that Thomas Widmann had been working on before he left, and I looked at some bibliographical data that Ann Ferguson had sent me last week, investigating what the files contained and how the data might be used.
Next week I will continue to focus on the SCOSYA project and try to get the Experts Atlas finished.
I focussed on the SCOSYA project for the first few days of this week. I need to get everything ready to launch by the end of September and there is an awful lot still left to do, so this is really my priority at the moment. I’d noticed over the weekend that the story pane wasn’t scrolling properly on my iPad when the length of the slide was longer than the height of the atlas. In such cases the content was just getting cut off and you couldn’t scroll down to view the rest or press the navigation buttons. This was weird as I thought I’d fixed this issue before. I spent quite a bit of time on Monday investigating the issue, which has resulted in me having to rewrite a lot of the slide code. After much investigation I reckoned that this was an intermittent fault caused by the code returning a negative value for the height of the story pane instead of its real height. When the user presses the button to load a new slide the code pulls the HTML content of the slide in and immediately displays it. After that another part of the code then checks the height of the slide to see if the new contents make the area taller than the atlas, and if so the story area is then resized. The loading of the HTML using jQuery’s html() function should be ‘synchronous’ – i.e. the following parts of code should not execute before the loading of the HTML is completed. But sometimes this wasn’t the case – the new slide contents weren’t being displayed before the check for the new slide height was being run, meaning the slide height check was giving a negative value (no contents minus the padding round the slide). The slide contents then displayed but as the code thought the slide height was less than the atlas it was not resizing the slide, even when it needed to. It is a bit of a weird situation as according to the documentation it shouldn’t ever happen. I’ve had to put a short ‘timeout’ into the script as a work-around – after the slide loads the code waits for half a second before checking for the slide height and resizing, if necessary. This seems to be working but it’s still annoying to have to do this. I tested this out on my Android phone and on my desktop Windows PC with the browser set to a narrow height and all seemed to be working. However, when I got home I tested the updated site out on my iPad and it still wasn’t working, which was infuriating as it was working perfectly on other touchscreens.
In order to fix the issue I needed to entirely change how the story pane works. Previously the story pane was just an HTML area that I’d added to the page and then styled to position within the map, but there were clearly some conflicts with the mapping library Leaflet when using this approach. The story pane was positioned within the map area and mouse actions that Leaflet picks up (scrolling and clicking for zoom and pan) were interfering with regular mouse actions in the HTML story area (clicking on links, scrolling HTML areas). I realised that scrolling within the menu on the left of the map was working fine on the iPad so I investigated how this differed from the story pane on the right. It turned out that the menu wasn’t just a plain HTML area but was instead created by a plugin for Leaflet that extends Leaflet’s ‘Control’ options (used for buttons like ‘+/-‘ and the legend). Leaflet automatically prevents the map’s mouse actions from working within its control areas, which is why scrolling in the left-hand menu worked. I therefore created my own Leaflet plugin for the story pane, based on the menu plugin. Using this method to create the story area thankfully worked on my iPad, but it did unfortunately taken several hours to get things working, which was time I should ideally have been spending on the Experts interface. It needed to be done, though, as we could hardly launch an interface that didn’t work on iPads.
I also has to spend some further time this week making some more tweaks to the story interface that the team had suggested such as changing the marker colour for the ‘Home’ maps, updating some of the explanatory text and changing the pop-up text on the ‘Home’ map to add in buttons linking through to the stories. The team also wanted to be able to have blank maps in the stories, to make users focus on the text in the story pane rather than getting confused by all of the markers. Having blank maps for a story slide wasn’t something the script was set up to expect, and although it was sort of working, if you navigated from a map with markers to a blank map and then back again the script would break, so I spent some time fixing this. I also managed to find a bit of time starting on the experts interface, although less time than I had hoped. For this I’ve needed to take elements from the atlas I’d created for staff use, but adapt it to incorporate changes that I’d introduced for the public atlas. This has basically meant starting from scratch and introducing new features one by one. So far I have the basic ‘Home’ map showing locations and the menu working. There is still a lot left to do.
I spent the best part of two days this week working on the front-end for the 18th Century Borrowing pilot project for Matthew Sangster. I wrote a little document that detailed all of the features I was intending to develop and sent this to Matt so he could check to see if what I’m doing met his expectations. I spent the rest of the time working on the interface, and made some pretty good progress. So far I’ve made an initial interface for the website (which is just temporary and any aspect of which can be changed as required), I’ve written scripts to generate the student forename / surname and professor title / surname columns to enable searching by surname, and I’ve created thumbnails of the images. The latter was a bit of a nightmare as previously I’d batch rotated the images 90 degrees clockwise as the manuscripts (as far as I could tell) were written in landscape format but the digitised images were portrait, meaning everything was on its side.
However, I did this using the Windows image viewer, which gives the option of applying the rotation to all images in a folder. What I didn’t realise is that the image viewer doesn’t update the metadata embedded in the images, and this information is used by browsers to decide which way round to display the images. I ended up in a rather strange situation where the images looked perfect on my Windows PC, and also when opened directly within the browser, but when embedded in an HTML page they appeared on their side. It took a while to figure out why this was happening, but once I did I regenerated the thumbnails using the command-line ImageMagick tool instead, which I set to wipe the image metadata as well as rotating the images, which seemed to work. That is until I realised that Manuscript 6 was written in portrait not landscape so I had to repeat the process again but miss out Manuscript 6. I have since realised that all the batch processing of images I did to generate tiles for the zooming and panning interface is also now going to be wrong for all landscape images and I’m going to have to redo all of this too.
Anyway, I also made the facility where a user can browse the pages of the manuscripts, enabling them to select a register, view the thumbnails of each page contained therein and then click through to view all of the records on the page. This ‘view records’ page has both a text and image view. The former displays all of the information about each record on the page in a tabular manner, including links through to the GUL catalogue and the ESTC. The latter presents the image in a zoomable / pannable manner, but as mentioned earlier, the bloody image is on its side for any manuscript written in a landscape way and I still need to fix this, as the following screenshot demonstrates:
Also this week I spent a further bit of time preparing for my PDR session that I will be having next week, spoke to Wendy Anderson about updates to the SCOTS Corpus advanced search map that I need to fix, fixed an issue with the Medical Humanities Network website, made some further tweaks to the RNSN song stories and spoke to Ann Ferguson at the DSL about the bibliographical data that needs to be incorporated into the new APIs. A another pretty busy week, all things considered.
I worked on several different projects this week. First of all I completed work on the new Medical Humanities Network website for Gavin Miller. I spent most of last week working on this but didn’t quite manage to get everything finished off, but I did this week. This involved completing the front-end pages for browsing through the teaching materials, collections and keywords. I still need to add in a carousel showing images for the project, and a ‘spotlight on…’ feature, as are found on the homepage of the UoG Medical Humanities site, but I’ll do this later once we are getting ready to actually launch the site. Gavin was hoping that the project administrator would be able to start work on the content of the website over the summer, so everything is in place and ready for them when they start.
With that out of the way I decided to return to some of the remaining tasks in the Historical Thesaurus / OED data linking. It had been a while since I last worked on this, but thankfully the list of things to do I’d previously created was easy to follow and I could get back into the work, which is all about comparing dates for lexemes between the two datasets. We really need to get further information from the OED before we can properly update the dates, but for now I can at least display some rows where the dates should be updated, based on the criteria we agreed on at our last HT meeting.
To begin with I completed a ‘post dating’ script. This goes through each matched lexeme (split into different outputs for ‘01’, ‘02’ and ‘03’ due to the size of the output) and for each it firstly changes (temporarily) any OED dates that are less than 1100 to 1100 and any OED dates that are greater than 1999 to 2100. This is so as to match things up with the HT’s newly updated Apps and Appe fields. The script then compares the HT Appe and OED Enddate fields (the ‘Post’ dates). It ignores any lexemes where these are the same. If they’re not the same the script outputs data in colour-coded tables.
In the Green table were lexemes where Appe is greater or equal to 1150, Appe is less than or equal to 1850 and Enddate is greater than Appe and the difference between Appe and Enddate is no more than 100 years OR Appe is greater than 1850 and Enddate is greater than Appe. The yellow table contains lexemes (other than the above) where Enddate is greater than Appe and the difference between Appe and Enddate is between 101 and 200. In the orange table there are lexemes where the Enddate is greater than Appe and the difference between Appe and Enddate is between 201 and 250, while the red table contained lexemes where the Enddate is greater than Appe and difference between Appe and Enddate is more than 200. It’s a lot of data, and fairly evenly spread between tables, but hopefully it will help us to ‘tick off’ dates that should be updated with figures from the OED data.
I then created an ‘ante dating’ script that looks at the ‘before’ dates (based on OED Firstdate (or ‘Sortdate’ as they call it) and HT apps. This looks at rows where Firstdate is earlier than Apps and splits the data up into colour coded chunks in a similar manner to the above script. I then created a further script that identifies lexemes where there is a later first date or an earlier end date in the OED data for manual checking, as such dates are likely to need investigation.
Finally, I create a script that brings back a list of all of the unique date forms in the HT. This goes through each lexeme and replaces individual dates with ‘nnnn’, then strings all of the various (and there are a lot) date fields together to create a date ‘fingerprint’. Individual date fields are separated with a bar (|) so it’s possible to extract specific parts. The script also made a count of the number of times each pattern was applied to a lexeme. So we have things like ‘|||nnnn||||||||||||||_’ which is applied to 341,308 lexemes (this is a first date and still in current use) and ‘|||nnnn|nnnn||-|||nnnn|nnnn||+||nnnn|nnnn||’ which is only used for a single lexeme. I’m not sure exactly what we’re going to use this information for, but it’s interesting to see the frequency of the patterns.
I spent most of the rest of the week working on the DSL. This included making some further tweaks to the WordPress version of the front-end, which is getting very close to being ready to launch. This included updating the way the homepage boxes work to enable staff to more easily control the colours used and updating the wording for search results. I also investigated an issue in the front end whereby slightly different data was being returned for entries depending on the way in which the data was requested. Using dictionary ID (e.g. https://dsl.ac.uk/entry/dost44593) brings back some additional reference text that is not returned when using the dictionary and href method (e.g. https://dsl.ac.uk/entry/dost/proces_n). It looks like the DSL API processes things differently depending on the type of call, which isn’t good. I also checked the full dataset I’d previously exported from the API for future use and discovered it is the version that doesn’t contain the full reference text, so I will need to regenerate this data next week.
My main DSL task was to work on a new version of the API that just uses PHP and MySQL, rather than technologies that Arts IT Support are not so keen on having on their servers. As I mentioned, I had previously run a script that got the existing API to spit out its fully generated data for every single dictionary entry and it’s this version of the data that I am currently building the new API around. My initial aim is to replicate the functionality of the existing API and plug a version of the DSL website into it so we can compare the output and performance of the new API to that of the existing API. Once I have the updated data I will create a further version of the API that uses this data, but that’s a little way off yet.
So far I have completed the parts of the API for getting data for a single entry and the data required by the ‘browse’ feature. Information on how to access the data, and some examples that you can follow, and included in the API definition page. Data is available as JSON (the default as used by the website) and CSV (which can be opened in Excel). However, while the CSV data can be opened directly in Excel any Unicode characters will be garbled, and long fields (e.g. the XML content of long entries) will likely be longer than the maximum cell size in Excel and will break onto new lines.
I also replicated the WordPress version of the DSL front-end here and set it up to work with my new API. As of yet the searches don’t work as I haven’t developed the search parts of the API, but it is possible to view individual entries and use the ‘browse’ facility on the entry page. These features use the new API and the new ‘fully generated’ data. This will allow staff to compare the display of entries to see if anything looks different.
I still need to work on the search facilities of the API, and this might prove to be tricky. The existing API uses Apache Solr for fulltext searching, which is a piece of indexing software that is very efficient for large volumes of text. It also brings back nice snippets showing where results are located within texts. Arts IT Support don’t really want Solr on their servers as it’s an extra thing for them to maintain. I am hoping to be able to develop comparable full text searches just using the database, but it’s possible that this approach will not be fast enough, or pinpoint the results as well as Solr does. I’ll just need to see how I get on in the coming weeks.
I also worked a little bit on the RNSN project this week, adding in some of the concert performances to the existing song stories. Next week I’m intending to start on the development of the front end for the SCOSYA project, and hopefully find some time to continue with the DSL API development.
As Monday was Easter Monday this was a four-day week for me. I spent almost the entire time working on Gavin Miller’s new Glasgow Medical Humanities project. This is a Wellcome Trust funded project that is going to take the existing University of Glasgow Medical Humanities Network resource (https://medical-humanities.glasgow.ac.uk/) that I helped set up for Megan Coyer a number of years ago and broaden it out to cover all institutions in Glasgow. The project will have a new website and interface, with facilities to enable an administrator to manage all of the existing data, plus add new data from both UoG and other institutions. I met with Gavin a few weeks ago to discuss how the new resource should function. He had said he wanted the functionality of a blog with additional facilities to manage the data about Medical Humanities projects, people, teaching materials, collections and keywords. The old site enabled any UoG based person to register and then log in to add data, but this feature was never really used – in reality all of the content was managed by the project administrators. As the new site would no longer be restricted to UoG staff we decided that to keep things simple and less prone to spamming we would not allow people to register with the site, and that all content would be directly managed by the project team. Anyone who wants to add or edit content would have to contact the project team and ask them to do so.
I wasn’t sure how best to implement the management of data. The old site had a different view of certain pages when an admin user was signed in, enabling them to manage the data, but as we’re no longer going to let regular users sign in I’d rather keep the admin interface completely separate. As a blog is required the main site will be WordPress powered, and there were two possible ways of implementing the admin interface for managing the project’s data. The first approach would be to write a plug-in for WordPress that would enable the data to be managed directly through the WordPress Admin interface. I took this approach with Gavin’s earlier SciFiMedHums project (https://scifimedhums.glasgow.ac.uk/). However, this does mean the admin interface is completely tied in to WordPress and if we ever wanted to keep the database going but drop the WordPress parts the process would be complicated. Also, embedding the data management pages within the WordPress Admin interface limits the layout options and can make the user interface more difficult to navigate. This brings me to the second option, which is to develop a separate content management system for the data, that connects to WordPress to supply user authentication, but is not connected to WordPress in any other way. I’ve taken this approach with several other projects, such as The People’s Voice (https://thepeoplesvoice.glasgow.ac.uk/). This approach allows greater flexibility in the creation of the interface, allows the Admin user to log in with their WordPress details, but as the system and WordPress are very loosely coupled any future separation will be straightforward to manage. The second option is the one I decided to adopt for the new project.
I spent the week installing WordPress, setting up a theme and some default pages, designing an initial banner image based on images from the old site, migrating the database to the new domain and tweaking it to make it applicable for data beyond the University of Glasgow and then developing the CMS for the project. This allows an Admin user to add, edit and delete information about Medical Humanities projects, people, teaching materials, collections and keywords. Thankfully I could adapt most of the code from the old site, although a number of tweaks had to be made along the way.
With the CMS in place I then began to create the front-end pages to access the data. As with projects such as The People’s Voice, these page connect to WordPress in order to pull in theme information, and are embedded within WordPress by means of menu items, but are otherwise separate entities with no connection to WordPress. In in future the pages need to function independently of WordPress the only updates required will be to delete a couple of lines of code that reference WordPress from the scripts, and everything else will continue to function. I created new pages to allow projects and people to be browsed, results to be displayed and individual records to be presented. Again, much of the code was adapted from the old website, and some new stuff was adapted from other projects I’ve worked on. I didn’t quite manage to get all of the front-end functionality working by the end of the week, and I still have the pages for teaching materials, collections and keywords to complete next week. The site is mostly all in place, though. Here’s a screenshot of one of the pages, but note that the interface, banner and colour scheme might change before the site goes live:
In addition to working on this project I also got the DSL website working via HTTPS (https://dsl.ac.uk/), which took a bit of sorting out with Arts IT Support but is fully working now. I also engaged in a pretty long email conversation about a new job role relating to the REF, and provided feedback on a new job description. Next week I hope to complete the work on the Glasgow Medical Humanities site, do some work for the DSL, maybe find some time to get back into Historical Thesaurus issues and also begin work on the front-end features for the SCOSYA project. Quite a lot to do, then.
This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday. This included the following:
Re-running category pattern matching scripts on the new OED categories: The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents). However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories. The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else. I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794). I’m not really sure why this is, or what might have happened to the GHT dates. The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.
Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched. There are 731307 non-OE words in the HT, so about 86% of these are ticked off. There are 751156 lexemes in the new OED data, so about 84% of these are ticked off. Whilst doing this task I noticed another unexpected thing about the new OED data: the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased. In the old OED data we have the following number of matched categories:
In the new OED data we have the following number of matched categories:
The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level. Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match. There aren’t too many, but they will need to be fixed manually.
Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category. Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories. But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?
Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work. It is currently set to only look at lexemes within matched categories. It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches. At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match. Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.
Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference. It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info. I did, however, encounter a potential issue: Not all HT lexemes have a firstd and lastd. E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column. In such cases the difference between HT and OED dates are massive, but not accurate. I wonder whether using HT’s apps and appe columns might work better.
Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’: I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’. There are 73919 such lexemes.
On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps. I now have a further long ‘to do’ list, which I will no doubt give more information about next week.
Other than HT duties I helped out with some research proposals this week. Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this. I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together. This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project. I’ll need to spend some further time next week writing a plan for the project.
I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain. I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.
Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website. The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones. This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh. I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good. Anyway, I’ve fixed the MED links in BTh now. Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact. I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type. The first type includes the entry ID, which is still used in the new MED URLs. This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.
Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link. This means there is no way to link directly to the relevant entry page. However, I can link to the search results page by passing the headword, and this works pretty well. So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.
I returned to work on Monday after being off last week. As usual there were a bunch of things waiting for me to sort out when I got back, so most of Monday was spent catching up with things. This included replying to Scott Spurlock about his Crowdsourcing project, responding to a couple of DSL related issues, updating access restrictions on the SPADE website, reading through the final versions of the DMP and other documentation for Matt Sangster and Katie Halsey’s project, updating some details on the Medical Humanities Network website, responding to a query about the use of the Thesaurus of Old English and speaking to Thomas Clancy about his Iona proposal.
With all that out of the way I returned to the OED / HT data linking issues for the Historical Thesaurus. In my absence last week Marc and Fraser had made some further progress with the linking, and had made further suggestions as to what strategies I should attempt to implement next. Before I left I was very much in the middle of working on a script that matched words and dates, and I hadn’t had time to figure out why this script was bringing back no matches. It turns out the HT ‘fulldate’ field was using long dashes, whereas I was joining the OED GHT dates with a short dash. So all matches failed. I replaced the long dashes with short ones and the script then displayed 2733 ‘full matches’ (where every stripped lexeme and its dates match) and 99 ‘partial matches’ (where more than 6 and 80% match both dates and stripped lexeme text). I also added in a new column that counts the number of matches not including dates.
Marc had alerted me to an issue where the number of OED matches was coming back as more than 100% so I then spent some time trying to figure out what was going on here. I updated both the ‘with dates’ and ‘no date check’ versions of the lexeme pattern matching scripts to add in the text ‘perc error’ to any percentage that’s greater than 100, to more easily search for all occurrences. There are none to be found in the script with dates, as matches are only added to the percentage score if their dates match too. On the ‘no date check’ script there are several of these ‘perc error’ rows and they’re caused for the most part by a stripped form of the word being identical to an existing non-stripped form. E.g. there are separate lexemes ‘she’ and ‘she-‘ in the HT data, and the dash gets stripped, so ‘she’ in the OED data ends up matching two HT words. There are some other cases that look like errors in the original data, though. E.g. in OED catid 91505 severity there’s the HT word ‘hard (OE-)’ and ‘hard (c1205-)’ and we surely shouldn’t have this word twice. Finally there are some forms where stripping out words results in duplicates – e.g. ‘pro and con’ and ‘pro or con’ both end up as ‘pro con’ in both OED and HT lexemes, leading to 4 matches where there should only be 2. There are no doubt situations where the total percentage is pushed over the 80% threshold or to 100% by a duplicate match – any duplicate matches where the percentage doesn’t get over 100 are not currently noted in the output. This might need some further work. Or, as I previously said, with the date check incorporated the duplicates are already filtered out, so it might not be so much of an issue.
I also then moved on to a new script that looks at monosemous forms. This script gets all of the unmatched OED categories that have a POS and at least one word and for each of these categories it retrieves all of the OED words. For each word the script queries the OED lexeme table to get a count of the number of times the word appears. Note that this is the full word, not the ‘stripped’ form, as the latter might end up with erroneous duplicates, as mentioned above. Each word, together with its OED date and GHT dates (in square brackets) and a count of the number of times it appears in the OED lexeme table is then listed. If an OED word only appears once (i.e. is monosemous) it appears in bold text. For each of these monosemous words the script then queries the HT data to find out where and how many times each of these words appears in the unmatched HT categories. All queries keep to the same POS but otherwise look at all unmatched categories, including those without an OEDmaincat. Four different checks are done, with results appearing in different columns: HT words where full word (not the stripped variety) matches and the GHT start date matches the HT start date; failing that, HT words where the full word matches but the dates don’t; failing either of these, HT words where the stripped forms of the words match and the dates match; failing all these, HT words where the stripped forms match but the dates don’t. For each of these the HT catid, OEDmaincat (or the text ‘No Maincat’ if there isn’t one), subcat, POS, heading, lexeme and fulldate are displayed. There are lots of monosemous words that just don’t appear in the HT data. These might be new additions or we might need to try pattern matching. Also, sometimes words that are monosemous in the OED data are polysemous in the HT data. These are marked with a red background in the data (as opposed to green for unique matches). Examples of these are ‘sedimental’, ‘meteorologically’, ‘of age’. Any category that has a monosemous OED word that is polysemous in the HT has a red border. I also added in some stats below the table. In our unmatched OED categories there are 24184 monosemous forms. There are 8086 OED categories that have at least one monosemous form that matches exactly one HT form. There are 220 OED monosemous forms that are polysemous in the HT. Now we just need to decide how to use this data.
Also this week I looked into an issue one of the REELS team was having when accessing the content management system (it turns out that some anti-virus software was mislabelling the site as having some kind of phishing software in it), and responded to a query about the Decadence and Translation Network website I’d set up. I also started to look at sourcing some Data Management Plans for an Arts Lab workshop that Dauvit Broun has asked me to help with next week. I also started to prepare my presentation for the Digital Editions workshop next week, which took a fair amount of time. I also met with Jennifer Smith and a new member of the SCOSYA project team in Friday morning to discuss the project and to show the new member of staff how the content management system works. It looks like my involvement with this project might be starting up again fairly soon.
On Tuesday Jeremy Smith contacted me to ask me to help out with a very last minute proposal that he is putting together. I can’t say much about the proposal, but it had a very tight deadline and required rather a lot of my time from the middle of the week onwards (and even into the weekend). This involved lots of email exchanges, time spent reading documentation, meeting with Luca, who might be doing the technical work for the project if it gets funded, and writing a Data Management Plan for the project. This all meant that I was unable to spend time working on other projects I’d hoped to work on this week, such as the Bilingual Thesaurus. Hopefully I’ll have time to get back into this next week, once the workshops are out of the way.