I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week. The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books. For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups. I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system. However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API. My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything. It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions. Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:
You can now select one or more groups and the markers in the group are highlighted in green. Press a group button a second time to remove the highlighting. However, there is still a lot to be done. For one thing, only the markers highlight, not the areas. It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers. I spent a long time trying to get the areas to highlight without success and will need to return to this another week. I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last. Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples. Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.
I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century. Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database. I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version. This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV. PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.
Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting. The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway. Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually. I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place. I’ll need to think about creating multiple passes for the data when I return to it next week.
For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on. It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me. This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments. We’ll just need to see how this all pans out.
I also spent some time implementing a Boolean search for the new DSL API. I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created. It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search. So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.
OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live. Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted. E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet). But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.
For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting. This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes. I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded. We will need to decide on a method in order to get the updated dates from the OED into a comparable format.
Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots. Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together. I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.
Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues. All in all a very busy week.
On Monday this week I attended the ‘Data Hack’ event organised by the SCOSYA project. This was a two-day event, with day one being primarily lectures while on the second day the participants could get their hands on some data and create things themselves. I only attended the first day and enjoyed hearing the speakers talk. It was especially useful to hear the geospatial visualisation speaker, and also to get a little bit of hands-on experience with R. Unfortunately during a brief and unscheduled demonstration of the SCOSYA ‘expert atlas’ interface the search for multiple attributes failed to work. I spent some time frantically trying to figure out why, as I hadn’t changed any of the code. It turned out that (unbeknownst to me) the version of PHP on the server had recently been updated and one tiny and seemingly insignificant bit of code was no longer supported in the newer version and instead caused a fatal error. is it’s no longer possible to set a variable as an empty string and then to use it as an array later on. For example, $varname = “”; and then later on $varname = “value”. Doing this in the newer version causes the script to stop running. Once I figured this out it was very easy to fix, but going through the code to identify what was causing the problem took quite a while.
Once I’d discovered the issue I checked with Arts IT support and they confirmed that they had upgraded the server. It would have been great if they’d let me know. I then had to go through all of the other sites that are hosted on the server to check if the error appeared anywhere else, which unfortunately it did. I think I’d managed to fix everything by the end of the week, though.
Also for SCOSYA I continued to work on the public atlas interface, this time focussing on the ‘stories’ (now called ‘Learn more’). Previously these appeared in a select box, and once you selected a story from the drop-down list and pressed a ‘show’ button the story would load. This was all a bit clunky, so I’ve now replaced it with a more nicely formatted list of stories, as with the ‘examples’. Clicking on one of these automatically loads the relevant story, as the screenshot below demonstrates:
Due to the asynchronous nature of AJAX calls, it’s not possible to just set up the click event once as it needs to be set up as a result of the data finishing loading. If it was set up independently the next slide would load before data was pulled in from the server and would therefore display nothing. However, after further thought I realised that the issue wasn’t occurring when the initial slide was loaded, only when the user presses the buttons. As this will only ever happen once the AJAX data has loaded (because otherwise the button isn’t there for the user to press) it should be ok to have the click event initiated outside of the ‘load data’ function. Thankfully I managed to get this issue sorted by Friday, when the project team was demonstrating the feature at another event. I also managed to sort the issue with the side panel not scrolling on mobile screens, which was being caused by ‘pointer events’ being set to none on the element that was to be scrolled. On regular screens this worked fine, as the scrollbar gets added in, but on touchscreens this caused issues.
For the rest of the week I worked on several different projects. I continued with the endless task of linking up the HT and OED datasets. This involved ticking off lexemes in matched categories based on comparing the stripped forms of the lexeme on their own. This resulted in around 32,000 lexeme matches being ticked off. It also uncovered an instance where an OED category has been linked to two different HT categories, which is clearly an error. I wrote a script to look into the issue of duplicate matches for both categories and lexemes, which shows 28 (so 14 incorrect) category matches and 136 (so 68 incorrect) lexeme matches. I also created a new stats page that displays statistics about both matched and unmatched categories and lexemes. I also ‘deticked’ a few category and lexeme matches that Fraser had sent to me in spreadsheets.
I continued to work with the new DSL data this week too. This included checking through some of the supplemental entry data from the server that didn’t seem to do exactly what the DSL people were expecting. I also set up a new subdomain where I’m replicating the functionality of the main DSL website, but using the new data exported from the server. This means it is now possible to compare the live data (using Peter’s original API) with the V2 data (extracted and saved fully assembled from Peter’s API, rather than having bits injected into it every time an entry is requested) and the V3 data (from the DSL people’s editor server), which should hopefully be helpful in checking the new data. I also continued to work on the new API, for both V2 and V3 versions of the data, getting the search results working with the new API. Next for me to do is add Boolean searching to the headword search, remove headword match type as discussed and then develop the full text searches (full / without quotes / quotes only). After that comes the bibliography entries.
Also this week I made another few tweaks to the RNSN song stories, gave access to the web stats for Seeing Speech and Dynamic Dialects to Fraser Rowan, who is going to do some work on them and met with Matthew Sangster to discuss a pilot website I’m going to put together for him about books and borrowing records in the 18th century at Glasgow. I also attended a meeting on Friday afternoon with the Anglo-Norman dictionary people, who were speaking to various people including Marc and Fraser, about redeveloping their online resource.
I continued to develop the public interface for the SCOSYA project this week, and also helped out with the preparations for next week’s Data Hack event that the project is organising, which involved sorting out hosted for a lot of sample data. On Monday I had a meeting with Jennifer and E, at which we went through the interface I had so far created and discussed things that needed updated or changed in some way. It was a useful meeting and I came away with a long list of things to do, which I then spent quite some time during the remainder of the week implementing. This included changing the font used throughout the site and drastically changing the base layer we use for the maps. I had previously created a very simple ‘green land, blue sea’ base map, which is what the team had requested, but they wanted to try something a bit simpler still – white sea and light grey land – in order to emphasise the data points more than anything else. I also removed all place-names from the map and in fact everything other than borders and water. I also updated the colour range used for ratings, from a yellow to red scheme to a more grey / purple scheme that had been suggested by E. This is now used both for the markers and for the areas. Regarding areas, I removed the white border from the areas to make areas with the same rating blend into one another and make the whole thing look more like a heatmap, as the following screenshot demonstrates:
I also completely changed the way the pop-ups look, as it was felt that the previous version was just a bit too garish and comic book like. The screenshot below shows markers with a pop-up open:
I also figured out how to add sound clips to story slides and I’ve changed how the selection of ‘examples’ works. Rather than having a drop-down list and then all of the information about a selected feature displayed underneath I have split things up. Now when you open the ‘Examples’ section you will see the examples listed as a series of buttons. Pressing on one of these then loads the feature, automatically loading the data for it into the map. There’s a button for returning to the list of examples, then the feature’s title and description, followed by sound clips if there are any are displayed. Underneath this are the buttons for changing ‘speakers’ and ‘locations’. Pressing on one of these options now automatically refreshes the map so there’s no longer any need for a ‘Show’ button. I think this works much better. Note that your choice of speaker and location is remembered when using the map – e.g. if you have selected ‘Young’ and ‘Areas’ then go back and select a different example then the map will default to ‘Young’ and ‘Areas’ when this new feature is displayed.
I’ve also added a check for screen size that fires every time a side panel section is opened. This ensures that if someone has resized their browser or changed the orientation of their screen the side panel should still fit. I still haven’t had time to get the ‘groups’ feature working yet, or to fix the display of stories on smaller screens. I also need to update the ‘Learn more’ section so it uses a list rather than a drop-down box, all tasks I hope to continue with next week.
I also spent a bit of time on the Seeing Speech and Dynamic Dialects projects, helping to add in a new survey for each, participated in the monthly College of Arts developers coffee catch-up and advised a couple of members of staff on blog related issues and spoke to Kirsteen McCue about the proposal she’s putting together.
Other than these tasks I spent about a day working on DSL issues. This included getting some data to Ann about which existing DSL entries were not present in the dataset that had been newly extracted from the server. This appears to have been caused by some entries being merged with existing entries. I also managed to get the new dataset uploaded to our temporary web-server and created a new API that outputs this new data. I still need to create an alternative version of the DSL front-end that connects to this new version of the data, which I hope to be able to at least get started on next week. I also did some investigation into scripts that Thomas Widmann had discussed in some hand-over documentation that did not seem to be available anywhere and discussed some issues relating to the server the DSL people host in their offices.
I also spent some time working on HT duties, making some tweaks to existing scripts based on feedback from Fraser, investigating why one of our categories is not accessible via the website (the answer being it was a subcategory that didn’t have a main category in the same part of speech so had no category to ‘hang’ off). I also had a further meeting with Marc and Fraser on Friday to discuss our progress with the HT OED linking.
Monday was a holiday this week, so Tuesday was the start of my working week. I spent about half the day completing work on the Data Management Plan that I had been asked to write by the College of Arts research people, and the remainder of the day continuing to write scripts the help in the linkup of HT and OED lexeme data. The latest script gets all unmatched HT words that are monosemous within part of speech in the unmatched dataset. For each of these the script then retrieves all OED words where the stripped form matches, as does POS, but the words are already matched to a different HT lexeme. If there is more than one OED lexeme matched to the HT lexeme I’ve added the information on subsequent rows in the table, so that full OED category information can more easily be read. I’m not entirely sure what this script will be used for, but Fraser seems to think it will be useful in automatically pinpointing certain words that the OED are currently trying to manually track down.
During the week I also made some further updates to a couple of song stories for the RNSN project and had a meeting over the phone with Kirsteen McCue about a new project she’s in the planning stages for at the moment, and which I will be helping with the technical aspects for. I also had a meeting with PhD student Ewa Wanat about a website she’s putting together and gave her some advice about things.
The rest of my week was split between DSL and SCOSYA. For DSL I spent time answering a number of emails. I then went through the SND data that had been outputted by Thomas Widmann’s scripts on the DSL server. I had tried running this data through the script I’d written to take the outputted XML data and insert it into our online MySQL database, but my script was giving errors, stating that the input file wasn’t valid XML. I loaded the file (an almost 90Mb text file) into Oxygen and asked it to validate the XML. It took a while, but managed to find one easily identifiable error, and one error that was trickier to track down.
In the entry for snd22907 there was a closing </sense> tag in the ‘History’ that has no corresponding opening tag. This was easy to track down and manually fix. Entry snd12737 had two opening tags (<entry id=”snd12737″>) one below the <meta> tag. This was trickier to find as I needed to manually track it down by chopping the file in half, checking which half the error was in, chopping this bit in half and so on until I ended up with a very small file in which it was easy to locate the problem.
With the SND data fixed I could then run it through my script. However, I wanted to change the way the script worked based on feedback from Ann last week. Previously I had added new fields to a test version of the main database, and the script found the matching row and inserted new data. I decided instead to create an entirely new table for the new data, to keep things more cleanly divided, and to handle the possibility of there being new entries in the data that were not present in the existing database. I also needed to update the way in which the URL tag was handled, as Ann had explained that there could be any number of URL tags, with them referencing other entries that has been merged with the current one. After updating my test version of the database to make new tables and fields, and updating my script to take these changes into consideration I ran both the DOST and the SND data through the script, resulting in 50,373 entries for DOST and 34,184 entries for SND. This is actually less entries than in the old database. There are 3023 missing SND entries and 1994 missing DOST entries. They are all supplemental entries (with IDs starting ‘snds’ in SND and ‘adds’ in DOST). This leaves just 24 DOST ‘adds’ entries in the Sienna data and 2730 SND ‘snds’ entries. I’m not sure what’s going on with the output – whether the omission of these entries is intentional (because the entries have been merged with regular entries) or whether this is an error, but I have exported information about the missing rows and have sent these on to Ann for further investigation.
For SCOSYA I focussed on adding in the sample sound clips and groupings for location markers. I also engaged with some preparations for the project’s ‘data hack’ that will be taking place in mid June. Adding in sound clips took quite a bit of time, as I needed to update both the Content Management System to allow sound clips to be uploaded and managed, and the API to incorporate links to the uploaded sound clips. This is in addition to incorporating the feature into the front-end.
Now if a member of staff logs into the CMS and goes to the ‘Browse codes’ page they will see a new column that lists the number of sound clips associated with a code. I’ve currently uploaded the four for E1 ‘This needs washed’ for test purposes. From the table, clicking on a code loads its page, which now includes a new section for sound clips. Any previously uploaded ones are listed and can be played or deleted. New clips in MP3 format can also be uploaded here, with files being renamed upon upload, based on the code and the next free auto-incrementing number in the database.
In the API all soundfiles are included in the ‘attributes’ endpoint, which is used by the drop-down list in the atlas. The public atlas has also been updated to include buttons to play any sound clips that are available, as the screenshot towards the end of this post demonstrates.
There is now a new section labelled ‘Listen’ with four ‘Play’ icons. Pressing on one of these plays a different sound clip. Getting these icons to work has been more tricky than you might expect. HTML5 has a tag called <audio> that browsers can interpret in order to create their own audio player which is then embedded in the page. This is what happens in the CMS. Unfortunately an interface designer has no control over the display of the player – it’s different in every browser and generally takes up a lot of room, which we don’t really have. I initially just used the HTML5 audio player but each sound clip then had to appear on a new row and the player in Chrome was too wide for the side panel.
I then moved on to looking at the groups for locations. This will be a fixed list of groups, taken from ones that the team has already created. I copied these groups to a new location in the database and updated the API to create new endpoints for listing groups and bringing back the IDs of all locations that are contained within a specified group. I haven’t managed to get the ‘groups’ feature fully working yet, but the selection options are now in place. There’s a ‘groups’ button in the ‘Examples’ section, and when you click on this a section appears listing the groups. Each group appears as a button. When you click on one a ‘tick’ is added to the button and currently the background turns a highlighted green colour. I’m going to include several different highlighted colours so the buttons light up differently. Although it doesn’t work yet, these colours will then be applied to the appropriate group of points / areas on the map. You can see an example of the buttons below:
The only slight reservation I have is that this option does make the public atlas more complicated to use and more cluttered. I guess it’s ok as the groups are hidden by default, though. There also may be an issue with the sidebar getting too long for narrow screens that I’ll need to investigate.