I completed an initial version of the Chambers Library map for the Books and Borrowing project this week. It took quite a lot of time and effort to implement the subscription period range slider. Searching for a range when the data also has a range of dates rather than a single date means we needed to make a decision about what data gets returned and what doesn’t. This is because the two ranges (the one chosen as a filter by the user and the one denoting the start and end periods of subscription for each borrower) can overlap in many different ways. For example, the period chosen by the user is 05 1828 to 06 1829. Which of the following borrowers should therefore be returned?
- Borrowers range is 06 1828 to 02 1829: Borrower’s range is fully within the period so should definitely be included
- Borrowers range is 01 1828 to 07 1828: Borrower’s range extends beyond the selected period at the start and ends within the selected period. Presumably should be included.
- Borrowers range is 01 1828 to 09 1829: Borrower’s range extends beyond the selected period in both directions. Presumably should be included.
- Borrowers range is 05 1829 to 09 1829: Borrower’s range begins during the selected period and ends beyond the selected period. Presumably should be included.
- Borrowers range is 01 1828 to 04 1828: Borrower’s range is entirely before the selected period. Should not be included
- Borrowers range is 07 1829 to 10 1829: Borrower’s range is entirely after the selected period. Should not be included.
Basically if there is any overlap between the selected period and the borrower’s subscription period the borrower will be returned. But this means most borrowers will always be returned a lot of the time. It’s a very different sort of filter to one that purely focuses on a single date – e.g. filtering the data to only those borrowers whose subscription periods *begins* between 05 1828 and 06 1829.
Based on the above assumptions I began to write the logic that would decide which borrowers to include when the range slider is altered. It was further complicated by having to deal with months as well as years. Here’s the logic in full if you fancy getting a headache:
if(((mapData[i].sYear>startYear || (mapData[i].sYear==startYear && mapData[i].sMonth>=startMonth)) && ((mapData[i].eYear==endYear && mapData[i].eMonth <=endMonth) || mapData[i].eYear<endYear)) || ((mapData[i].sYear<startYear ||(mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth)) && ((mapData[i].eYear==endYear && mapData[i].eMonth >=endMonth) || mapData[i].eYear>endYear)) || ((mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth || mapData[i].sYear>startYear) && ((mapData[i].eYear==endYear && mapData[i].eMonth <=endMonth) || mapData[i].eYear<endYear) && ((mapData[i].eYear==startYear && mapData[i].eMonth >=startMonth) || mapData[i].eYear>startYear)) || (((mapData[i].sYear==startYear && mapData[i].sMonth>=startMonth) || mapData[i].sYear>startYear) && ((mapData[i].sYear==endYear && mapData[i].sMonth <=endMonth) || mapData[i].sYear<endYear) && ((mapData[i].eYear==endYear && mapData[i].eMonth >=endMonth) || mapData[i].eYear>endYear)) || ((mapData[i].sYear<startYear ||(mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth)) && ((mapData[i].eYear==startYear && mapData[i].eMonth >=startMonth) || mapData[i].eYear>startYear)))
I also added the subscription period to the popups. The only downside to the range slider is that the occupation marker colours change depending on how many occupations are present during a period, so you can’t always tell an occupation by its colour. I might see if I can fix the colours in place, but it might not be possible.
I also noticed that the jQuery UI sliders weren’t working very well on touchscreens so installed the jQuery TouchPunch library to fix that (https://github.com/furf/jquery-ui-touch-punch). I also made the library marker bigger and gave it a white border to more easily differentiate it from the borrower markers.
I then moved onto incorporating page images in the resource too. Where a borrower has borrower records the relevant pages where these borrowing records are found now appear as thumbnails in the borrower popup. These are generated by the IIIF server based on dimensions passed to it, which is much nicer than having to generate and store thumbnails directly. I also updated the popup to make it wider when required to give more space for the thumbnails. Here’s a screenshot of the new thumbnails in action:
Clicking on a thumbnail opens a further popup containing a zoomable / pannable image of the page. This proved to be rather tricky to implement. Initially I was going to open a popup in the page (outside of the map container) using a jQuery UI Dialog. However, I realised that this wouldn’t work when the map was being viewed in full-screen mode, as nothing beyond the map container is visible in such circumstances. I then considered opening the image in the borrower popup but this wasn’t really big enough. I then wondered about extending the ‘Map options’ section and replacing the contents of this with the image, but this then caused issues for the contents of the ‘Map options’ section, which didn’t reinitialise properly when the contents were reinstated. I then found a plugin for the Leaflet mapping library that provides a popup within the map interface (https://github.com/w8r/Leaflet.Modal) and decided to use this. However, it’s all a little complex as the popup then has to include another mapping library called OpenLayers that enables the zooming and panning of the page image, all within the framework of the overall interactive map. It is all working and I think it works pretty well, although I guess the map interface is a little cluttered, what with the ‘Map Options’ section, the map legend, the borrower popup and then the page image popup as well. Here’s a screenshot with the page image open:
All that’s left to do now is add in the introductory text once Alex has prepared it and then make the map live. We might need to rearrange the site’s menu to add in a link to the Chambers Map as it’s already a bit cluttered.
Also for the project I downloaded images for two further library registers for St Andrews that had previously been missed. However, there are already records for the registers and pages in the CMS so we’re going to have to figure out a way to work out which image corresponds to which page in the CMS. One register has a different number of pages in the CMS compared to the image files so we need to work out how to align the start and end and if there are any gaps or issues in the middle. The other register is more complicated because the images are double pages whereas it looks like the page records in the CMS are for individual pages. I’m not sure how best to handle this. I could either try and batch process the images to chop them up or batch process the page records to join them together. I’ll need to discuss this further with Gerry, who is dealing with the data for St Andrews.
Also this week I prepared for and gave a talk to a group of students from Michigan State University who were learning about digital humanities. I talked to them for about an hour about a number of projects, such as the Burns Supper map (https://burnsc21.glasgow.ac.uk/supper-map/), the digital edition I’d created for New Modernist Editing (https://nme-digital-ode.glasgow.ac.uk/), the Historical Thesaurus (https://ht.ac.uk/), Books and Borrowing (https://borrowing.stir.ac.uk/) and TheGlasgowStory (https://theglasgowstory.com/). It went pretty and it was nice to be able to talk about some of the projects I’ve been involved with for a change.
I also made some further tweaks to the Gentle Shepherd Performances page which is now ready to launch, and helped Geert out with a few changes to the WordPress pages of the Anglo-Norman Dictionary. I also made a few tweaks to the WordPress pages of the DSL website and finally managed to get a hotel room booked for the DHC conference in Sheffield in September. I also made a couple of changes to the new Gaelic Tongues section of the Seeing Speech website and had a discussion with Eleanor about the filters for Speech Star. Fraser had been in touch with about 500 Historical Thesaurus categories that had been newly matched to OED categories so I created a little script to add these connections to the online database.
I also had a Zoom call with the Speak For Yersel team. They had been testing out the resource at secondary schools in the North East and have come away with lots of suggested changes to the content and structure of the resource. We discussed all of these and agreed that I would work on implementing the changes the week after next.
Next week I’m going to be on holiday, which I have to say I’m quite looking forward to.
This week I finished off all of the outstanding work for the Speak For Yerself project. The other members of the team (Jennifer and Mary) are both on holiday so I finished off all of the tasks I had on my ‘to do’ list, although there will certainly be more to do once they are both back at work again. The tasks I completed were a mixture of small tweaks and larger implementations. I made tweaks to the ‘About’ page text and changed the intro text to the ‘more give your word’ exercise. I then updated the age maps for this exercise, which proved to be pretty tricky and time-consuming to implement as I needed to pull apart a lot of the existing code. Previously these maps showed ‘60+’ and ‘under 19’ data for a question, with different colour markers for each age group showing those who would say a term (e.g. ‘Scunnered’) and grey markers for each age group showing those who didn’t say the term. We have completely changed the approach now. The maps now default to showing ‘under 19’ data only, with different colours for each different term. There is now an option in the map legend to switch to viewing the ‘60+’ data instead. I added in the text ‘press to view’ to try and make it clearer that you can change the map. Here’s a screenshot:
I also updated the ‘give your word’ follow-on questions so that they are now rated in a new final page that works the same way as the main quiz. In the main ‘give your word’ exercise I updated the quiz intro text and I ensured that the ‘darker dots’ explanatory text has now been removed for all maps. I tweaked a few questions to change their text or the number of answers that are selectable and I changed the ‘sounds about right’ follow-on ‘rule’ text and made all of the ‘rule’ words lower case. I also made it so that when the user presses ‘check answers’ for this exercise a score is displayed to the right and the user is able to proceed directly to the next section without having to correct their answers. They still can correct their answers if they want.
I then made some changes to the ‘She sounds really clever’ follow-on. The index for this is now split into two sections, one for ‘stereotype’ data and one for ‘rating speaker’ data and you can view the speaker and speaker/listener results for both types of data. I added in the option of having different explanatory text for each of the four perception pages (or maybe just two – one for stereotype data, one for speaker ratings) and when viewing the speaker rating data the speaker sound clips now appear beneath the map. When viewing the speaker rating data the titles above the sliders are slightly different. Currently when selecting the ‘speaker’ view the title is “This speaker from X sounds…” as opposed to “People from X sound…”. When selecting the ‘speaker/listener’ view the title is “People from Y think this speaker from X sounds…” as opposed to “People from Y think people from X sound…”. I also added a ‘back’ button to these perception follow-on pages so it’s easier to choose a different page. Finally, I added some missing HTML <title> tags to pages (e.g. ‘Register’ and ‘Privacy’) and fixed a bug whereby the ‘explore more’ map sound clips weren’t working.
With my ‘Speak For Yersel’ tasks out of the way I could spend some time looking at other projects that I’d put on hold for a while. A while back Eleanor Lawson contacted me about adding a new section to the Seeing Speech website where Gaelic speaker videos and data will be accessible, and I completed a first version this week. I replicated the Speech Star layout rather than the /r/ & /l/ page layout as it seemed more suitable: the latter only really works for a limited number of records while the former works well with lots more (there are about 150 Gaelic records). What this means is the data has a tabular layout and filter options. As with Speech Star you can apply multiple filters and you can order the table by a column by clicking on its header (clicking a second time reverses the order). I’ve also included the option to open multiple videos in the same window. I haven’t included the playback speed options as the videos already include the clip at different speeds. Here’s a screenshot of how the feature looks:
On Thursday I had a Zoom call with Laura Rattray and Ailsa Boyd to discuss a new digital edition project they are in the process of planning. We had a really great meeting and their project has a lot of potential. I’ve offered to give technical advice and write any technical aspects of the proposal as and when required, and their plan is to submit the proposal in the autumn.
My final major task for the week was to continue to work on the Ramsay ‘Gentle Shepherd’ data. I overhauled the filter options that I implemented last week so they work in a less confusing way when multiple types are selected now. I’ve also imported the updated spreadsheet, taking the opportunity to trim whitespace to cut down on strange duplicates in the filter options. There are some typos you’ll need to fix in the spreadsheet, though (e.g. we have ‘Glagsgow’ and ‘Glagsow’) plus some dates still need to be fixed.
I then created an interactive map for the project and have incorporated the data for which there are latitude and longitude values. As with the Edinburgh Gazetteer map of reform societies (https://edinburghgazetteer.glasgow.ac.uk/map-of-reform-societies/) the number of performances at a venue is displayed in the map marker. Hover over a marker to see info about the venue. Click on it to open a list of performances. Note that when zoomed out it can be difficult to make out individual markers but we can’t really use clustering as on the Burns Supper map (https://burnsc21.glasgow.ac.uk/supper-map/) because this would get confusing: we’d have clustered numbers representing the number of markers in a cluster and then induvial markers with a number representing the number of performances. I guess we could remove the number of performances from the marker and just have this in the tooltip and / or popup, but it is quite useful to see all the numbers on the map. Here’s a screenshot of how the map currently looks:
I still need to migrate all of this to the University’s T4 system, which I aim to tackle next week.
Also this week I had discussions about migrating an externally hosted project website to Glasgow for Thomas Clancy. I received a copy of the files and database for the website and have checked over things and all is looking good. I also submitted a request for a temporary domain and I should be able to get a version of the site up and running next week. I also regenerated a list of possible duplicate authors in the Books and Borrowing system after the team had carried out some work to remove duplicates. I will be able to use the spreadsheet I have now to amalgamate duplicate authors, a task which I will tackle next week.
I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week. The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books. For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups. I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system. However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API. My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything. It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions. Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:
You can now select one or more groups and the markers in the group are highlighted in green. Press a group button a second time to remove the highlighting. However, there is still a lot to be done. For one thing, only the markers highlight, not the areas. It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers. I spent a long time trying to get the areas to highlight without success and will need to return to this another week. I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last. Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples. Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.
I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century. Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database. I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version. This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV. PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.
Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting. The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway. Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually. I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place. I’ll need to think about creating multiple passes for the data when I return to it next week.
For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on. It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me. This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments. We’ll just need to see how this all pans out.
I also spent some time implementing a Boolean search for the new DSL API. I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created. It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search. So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.
OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live. Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted. E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet). But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.
For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting. This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes. I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded. We will need to decide on a method in order to get the updated dates from the OED into a comparable format.
Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots. Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together. I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.
Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues. All in all a very busy week.
I continued to develop the public interface for the SCOSYA project this week, and also helped out with the preparations for next week’s Data Hack event that the project is organising, which involved sorting out hosted for a lot of sample data. On Monday I had a meeting with Jennifer and E, at which we went through the interface I had so far created and discussed things that needed updated or changed in some way. It was a useful meeting and I came away with a long list of things to do, which I then spent quite some time during the remainder of the week implementing. This included changing the font used throughout the site and drastically changing the base layer we use for the maps. I had previously created a very simple ‘green land, blue sea’ base map, which is what the team had requested, but they wanted to try something a bit simpler still – white sea and light grey land – in order to emphasise the data points more than anything else. I also removed all place-names from the map and in fact everything other than borders and water. I also updated the colour range used for ratings, from a yellow to red scheme to a more grey / purple scheme that had been suggested by E. This is now used both for the markers and for the areas. Regarding areas, I removed the white border from the areas to make areas with the same rating blend into one another and make the whole thing look more like a heatmap, as the following screenshot demonstrates:
I also completely changed the way the pop-ups look, as it was felt that the previous version was just a bit too garish and comic book like. The screenshot below shows markers with a pop-up open:
I also figured out how to add sound clips to story slides and I’ve changed how the selection of ‘examples’ works. Rather than having a drop-down list and then all of the information about a selected feature displayed underneath I have split things up. Now when you open the ‘Examples’ section you will see the examples listed as a series of buttons. Pressing on one of these then loads the feature, automatically loading the data for it into the map. There’s a button for returning to the list of examples, then the feature’s title and description, followed by sound clips if there are any are displayed. Underneath this are the buttons for changing ‘speakers’ and ‘locations’. Pressing on one of these options now automatically refreshes the map so there’s no longer any need for a ‘Show’ button. I think this works much better. Note that your choice of speaker and location is remembered when using the map – e.g. if you have selected ‘Young’ and ‘Areas’ then go back and select a different example then the map will default to ‘Young’ and ‘Areas’ when this new feature is displayed.
I’ve also added a check for screen size that fires every time a side panel section is opened. This ensures that if someone has resized their browser or changed the orientation of their screen the side panel should still fit. I still haven’t had time to get the ‘groups’ feature working yet, or to fix the display of stories on smaller screens. I also need to update the ‘Learn more’ section so it uses a list rather than a drop-down box, all tasks I hope to continue with next week.
I also spent a bit of time on the Seeing Speech and Dynamic Dialects projects, helping to add in a new survey for each, participated in the monthly College of Arts developers coffee catch-up and advised a couple of members of staff on blog related issues and spoke to Kirsteen McCue about the proposal she’s putting together.
Other than these tasks I spent about a day working on DSL issues. This included getting some data to Ann about which existing DSL entries were not present in the dataset that had been newly extracted from the server. This appears to have been caused by some entries being merged with existing entries. I also managed to get the new dataset uploaded to our temporary web-server and created a new API that outputs this new data. I still need to create an alternative version of the DSL front-end that connects to this new version of the data, which I hope to be able to at least get started on next week. I also did some investigation into scripts that Thomas Widmann had discussed in some hand-over documentation that did not seem to be available anywhere and discussed some issues relating to the server the DSL people host in their offices.
I also spent some time working on HT duties, making some tweaks to existing scripts based on feedback from Fraser, investigating why one of our categories is not accessible via the website (the answer being it was a subcategory that didn’t have a main category in the same part of speech so had no category to ‘hang’ off). I also had a further meeting with Marc and Fraser on Friday to discuss our progress with the HT OED linking.
This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.
After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief. I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least. As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before. For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background. So orange now only contains categories where at least one word and its start date match. The ones now listed in purple are almost certainly incorrect matches. I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.
I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page). The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data. Where there are multiple monosemous forms they each correspond to the same category in the HT data. The second table features 158 OED categories where the linked HT forms appear in more than one category. This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page. An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words). The final table contains those OED categories that feature monosemous words that have no match in the HT data. There are 1232 of these. I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created. On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.
Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together. I can’t really say much more about these activities, but it took at least a day to get all of this done. I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live. These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/. I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.
With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week. In fact, I added another three items to it. I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue. I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website. I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API. I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this. I might return to it before the launch if there’s time, but it’s not a massive issue. I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference. I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters. I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.
I also continued to work on the Bilingual Thesaurus website this week. I updated the way in which source links work. Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up. They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears. For OED links I’ve also added the text ‘subscription required’ to the hover-over text. I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more. I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are. This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible. You can click on any of the parts of the hierarchy here to jump to that point.
I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants. I did this for the Historical Thesaurus and it’s really useful. What I’ve done so far is generate alternatives for words that have brackets and dashes. For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman. None of these varieties will ever appear on the website, but instead will be used to find the word when people search. I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working. It’s not entirely there yet, but I should get this working next week. I also need to know what should be done about accented characters for search purposes. The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’. However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.
I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level. However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.
My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me. I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields. It took about 2 and a half hours to make this timeline. However, hopefully the end result will be worth it.
I continued to work on the HT / OED data alignment for a lot of this week. I updated the matching scripts I had previously created so that all matches based on last lexeme were removed and instead replaced by a ‘6 matches or more and 80% of words in total match’ check. This was a lot more effective that purely comparing the last word in each category and helped match up a lot more categories. I also created a QA script to check the manual matches that were made during our first phase of matching. There are 1407 manual matches in the system. The script also listed all the words in each potential matched category to make it easier to tell where any potential difficulties were. I also updated the ‘pattern matching’ script I’d created last week to list all words and include the ‘6 matches and 80%’ check and changed the layout so that separate groupings now appear in different tables rather than being all mixed up in one table. It took quite a long time to sort this out, but it’s going to be much more useful for manual checking.
I then moved on to writing a new ‘sibling matching’ script. This script goes through all unmatched OED categories (this includes all that appear in other scripts such as the pattern matching one) and retrieves all sibling categories of the same POS. E.g. if the category is ‘01.01.01|03 (n)’ then the script brings back all HT noun subcats of ’01.01.01’ that are ‘level 1’ subcats and compares their headings. It then looks to see if there is a sibling category that has the same heading – i.e. looking for when a category has been renumbered within the same level of the thesaurus. This has uncovered several hundred such potential matches, which will hopefully be very helpful. I also then created a further script that compares non-noun headings to noun headings at the same level, as it looked like a number of times the OED kept the noun heading for other parts of speech while the HT renamed them. This identified a further 65 possible matches, which isn’t too bad.
I met with Marc and Fraser on Wednesday to discuss the recent updates I’d made, after which I managed to tick off 2614 matched categories, taking our total of unmatched OED categories that have a part of speech and are not empty down to 10,854. I then made a start on a new script that looks at pattern matching for category contents (i.e. words), but I didn’t have enough time to make a huge amount of progress with this.
to try and get things working but the callbacks were never being initiated – i.e. data wasn’t getting through to Google. Thankfully Stack Overflow had an answer that worked (After trying several that didn’t):
I’ve updated this so that pageviews rather than events are sent and now everything seems to be working again.
I spent a bit more time this week working on the Bilingual Thesaurus project, focussing on getting the front end for the thesaurus working. I’ve reworked the code for the HT’s browse facility to work with the project’s data. This required quite a lot of work as structurally the datasets are quite different – the HT relies in its ‘tier’ numbers for parent / child / sibling category relationships, and also has different categories for parts of speech and nested subcategories. The BTH data is much simpler (which is great) as it just has parent and child categories, with things like part of speech handled at word level. This meant I had to strip a lot of stuff out of the code and rework things. I’m also taking the opportunity to move to a new interface library (Bootstrap) so had to rework the page layout to take this into consideration too. I managed to get an initial version of the browse facility working now, which works in much the same way as the main HT site: clicking on a heading allows you to view its words and clicking on a ‘plus’ sign allows you to view the child categories. As with the HT you can link directly to a category too. I do still need to work on the formatting of the category contents, though. Currently words are just listed all together, with their type (AN or ME) listed first, then the word, then the POS in brackets, then dates (if available). I haven’t included data about languages of source or citation yet, or URLs. I’m also going to try and get the timeline visualisations working as well. I’ll probably split the AN and ME words into separate tabs, and maybe split the list up by POS too. I’m also wondering whether the full category hierarchy should be represented above the selected category (the right pane), as unlike the HT there’s no category number to show your position in the thesaurus. Also, as a lot of the categories are empty I’m thinking of making the ones with words in them bold in the tree, or even possibly adding a count of words in brackets after the category heading. I’ve also updated the project’s homepage to include the ‘sample category’ feature, allowing you to press the ‘reload’ icon to load a new random category.
On Friday I spent most of the day working on the RNSN project, adding direct links to the ‘nation’ introductions to the main navigation menu and creating new ‘storymap’ stories based on Powerpoint presentations that had been sent to me. This is actually quite a time-consuming process as it involves grabbing images from the PPT, reformatting them, uploading them to WordPress, linking to them from the Storymap pages, creating Zoomified versions of the image or images that will be used as the ‘map’ for the story, extracting audio files from the PPT and uploading them, grabbing all of the text and formatting it for display and other such tasks. However, despite being a long process the end result is definitely worth it as the stroymaps work very nicely. I managed to get two such stories completed today, and now I’ve re-familiarised myself with the process it should be quicker when the next set get sent to me.
I’m going to be on holiday next week so there won’t be another report from me until the week after that.
Having left Rob Maslen’s Fantasy blog in a somewhat unfinished state last Friday due to server access issues, I jumped straight into completing this work on Monday morning. Thankfully I could access the server again and after spending an hour or so tweaking header images, choosing colour schemes and fonts, reinstating widgets, menus and such things I managed to get the site fully working again, with a fully responsive theme: http://fantasy.glasgow.ac.uk/. I also updated some content on the Burns Paper Database website for Ronnie Young, completed my PDR, responded to a query about TheGlasgowStory and met with Matt Barr in Computing Science to discuss some possible future developments. I also made some further tweaks to the Seeing Speech and Dynamic Dialect website upgrades that are still ongoing. Eleanor had created new versions of some of the videos, so I uploaded them, and also updated all of the images in the image carousels for both sites.
I spent a fair amount of time this week updating the maps on the ‘Saints in Scottish Place-Names’ website. As mentioned in a previous post, the maps on this site all use Google maps, and Google now blocks access to their maps API unless you connect via an account that has a credit card associated with it. This is not very good for legacy research projects such as this one, so the plan was that I’d migrate the maps from Google to the free and open source Leaflet.js mapping library. Another advantage of Leaflet is that the scripts are all stored on the same server as the rest of the resource – we’re no longer reliant on a third-party server so there should be less risk of the maps becoming unavailable in future. Of course the map layers themselves are all stored on other third-party servers, but the ones I’ve chosen (based on the ones I selected for the REELS project) are all free to use, and another benefit of Leaflet is that it’s very simple to switch out one map layer for another – so if one tileset becomes unavailable I can replace it very quickly with another.
I created a new Leaflet powered version of the website in a subdirectory so I could test things out without messing up the live site. As far as I could tell there were four pages that featured maps, each using them in different ways. I migrated all of them over to the Leaflet mapping library and incorporated base maps and other features from the REELS and KCB map interface, namely:
- A map ‘display options’ button in the top left of the map that opens a panel through which you can change the base map.
- A choice of 6 base maps, as with REELS and KCB:
- A default topographical map
- A satellite map
- A satellite map with things like roads, rivers and settlements marked on it
- A modern OS map
- A historical OS map from 1840-1888
- A historical OS map from 1920-1933
- An ‘Attribution and copyright’ popup linked to from the bottom right of the map, which I adapted from REELS.
- A ‘full screen’ button in the bottom right of the map that allows you to view any map full screen. I’ve removed the ‘view larger map’ option on the Saints page as I didn’t think this was really necessary when the ‘full screen’ option is available anyway.
- A map scale (metric and imperial) appears in the bottom left of the map.
Here’s some information about the four map types that I updated:
- Place map
This is the simplest map and displays a marker showing the location of the place. Hover over the marker to view the place-name.
- Saint map
This map colour codes the markers based on ‘certainty’. I used the same coloured markers as found on the original map. I also added a map legend to the top right that shows you what the colours represent. You can turn any of the layers on or off to make it easier to see the markers you’re interested in (e.g. hide all ‘certain’ markers). I removed the legend section that appeared underneath the original map as this is no longer needed due to the in-map version.
- Search map
As with the original version, when you zoom in on an area any place-names found in the vicinity appear as red dots. I updated the functionality slightly so that as you pan round the map at one zoom level new markers continue to load (with the previous version you had to change the zoom level to initiate the loading of new markers). Now as you pan around new red spots appear all over the place like measles.
- Search results map
I couldn’t get the original version of this map to work at all, so I think there must have been some problem with it in addition to the Google Maps issue. Anyway, the new version displays the search results on a map, and if the search included a saint then the results are categorised by ‘certainty’ as with the saint map. You can turn certainty levels on or off. You can also open the marker pop-ups to link through to the place-name record and the saint record too.
There will no doubt be a few further tweaks that will be required before I replace the live site with the new version I’ve been working on, but I reckon that bulk of the work is now done.
I also continued with the Bilingual Thesaurus project, although I didn’t have as much time as I had hoped to work on this. However, I updated the ‘language of origin’ data for the 1829 headwords that had no language of origin, assigning ‘uncertain’ to all of them. I also noticed that 15 headwords have no ‘date of citation’ and I asked Louise whether this was ok. I also updated the way I’m storing dates. Previously I had set up a separate table where any number of date fields could be associated with a headword. Instead I have now added two new columns to the main ‘lexeme’ table: startdate and enddate. I then wrote a script that went through the originally supplied dates (e.g. [1230,1450], adding the first date to the startdate column and the second date to the enddate column. Where an enddate is not supplied or is ‘0’ I’ve added the startdate to this column, just to make it clearer that this is a single year. Louise had mentioned that some dates would have ‘1450+’ as a second date but I’ve checked the original JSON file I was given and no dates have the plus sign, o I’ve checked with her in case this data has somehow been lost. I also discovered that there are 16 headwords that have an enddate but no startdate (e.g. the date in the original JSON file is something like [0,1436] and have asked what should happen to these. Finally, I made a start on the front-end for the resource. There is very little in place yet, but I’ve started to create a ‘Bootstrap’ based interface using elements from the other thesaurus websites (e.g. logo, fonts). Once a basic structure is in place I’ll get the required search and browse facilities up and running and we can then thing about things such as colour schemes and site text.
I spent the rest of the week on the somewhat Sisyphean task of matching up the HT and OED category data. This is a task that Fraser and I have been working with on and off for over a year now, and it seemed like the end was in sight, as we were down to just a few thousand OED categories that were unmatched. However, last week I noticed some errors in the category matching, and some further occasions where an OED category has been connected to multiple HT categories. Marc, Fraser and I met on Monday to discuss the process and Marc suggested we start the matching process from scratch again. I was rather taken aback by this as there appeared to only be a few thousand erroneous matches out of more than 220,000 and it seemed like a shame to abandon all our previous work. However, I’ve since realised this is something that needed to be done, mainly because the previous process wasn’t very well documented and could not be easily replicated. It’s a process that Fraser and I could only focus on between other commitments and progress was generally tracked via email conversations and a few Word documents. It was all very experimental, and we often ran a script, which matched a group of categories, then altered the script and ran it again, often several times in succession. We also approached the matching from what I realise now is the wrong angle – starting with the HT categories and trying the match these to the OED categories. However, it’s the OED categories that need to be matched and it doesn’t really matter if HT categories are left unmatched (as plenty will as they are more recent additions or are empty place-holder categories). We’ve also learned a lot from the initial process and have identified certain scripts and processes that we know are the most likely to result in matches.
It was a bit of a wrench, but we have now abandoned our first stab at category matching and are starting over again. Of course, I haven’t deleted the previous matches so no data has been lost. Instead I’ve created new ‘v2’ matching fields and I’m being much more rigorous in documenting the processes that we’re putting the data through and ensuring every script is retained exactly as it was when it performed a specific task rather than tweaking and reusing scripts.
I then ran an initial matching script that looked for identical matches – where the maincat, subcat, part of speech and ‘stripped’ heading were all identical. This matched 202030 OED categories, leaving just 27,295 unmatched. However, it is possible that not all of these 202030 matches are actually correct. This is because quite often a category heading is reused – e.g. there are lots of subcats that have the heading ‘pertaining to’ – so it’s possible that a category might look identical but in actual fact be something completely different. To check for this I ran a script that the combination of the stripped heading and the part of speech appears in more than one category. There are 166096 matched categories where this happens. For these the script then compares the total number of words and the last word in each match to see whether the match looks valid. There were 12,640 where the number of words or the last word are not the same and I created a further script that then checked whether these had identical parent category headings. This then identified 2,414 that didn’t. These will need further checking.
I also noticed that a small number of HT categories had a parent whose combination of ‘oedmaincat’, ‘subcat’ and ‘pos’ information was not unique. This is an error and I created a further script to list all such categories. Thankfully there are only 98 and Fraser is going to look at these. I also created a new stats page for our V2 matching process, which I will hopefully continue to make good progress with next week.
I met with Matthew Creasey from English Literature this week to discuss a project website for his recently funded ‘Decadence and Translation Network’ project. The project website is going to be a fairly straightforward WordPress site, but there will also be a digital edition hosted through it, which will be sort of similar to what I did for the Woolf short story for the New Modernist Editing project (https://nme-digital-ode.glasgow.ac.uk/). I set up an initial site for Matthew and will work on it further once he receives the images he’d like to use in the site design.
I also gave some further help with Craig Lamont in getting access to Google Analytics for the Ramsay project, and spoke to Quintin Cutts in Computing Science about publishing an iOS app they have created. I also met with Graeme Cannon to discuss AHRC Data Management Plans, as he’s been asked to contribute to one and hasn’t worked with a plan yet. I also made a couple of minor fixes to the RNSN timeline and storymap pages and updated the ‘attribution’ text on the REELS map. There’s quite a lot of text relating to map attribution and copyright so instead of cluttering up the bottom of the maps I’ve moved everything into a new pop-up window. In addition to the statements about the map tilesets I’ve also added in a statement about our place-name data, the copyright statement that’s required for the parish boundaries, a note about Leaflet and attribution for the map icons too. I think it works a lot better.
Other than these issues I mainly focussed on three projects this week. For the SCOSYA project I tackled an issue with the ‘or’ search, that was causing the search to not display results in a properly categorised manner when the ‘rated by’ option was set to more than one. It took a while to work through the code, and my brain hurt a bit by the end of it, but thankfully I managed to figured out what the problem was. Basically when ‘rated by’ was set to 1 the code only needed to match a single result for a location. If it found one that matched then the code stopped looking any further. However, when multiple results need to be found that match, the code didn’t stop looking, but instead had to cycle through all other results for the location, including those for other codes. So if it found two matches that met the criteria for ‘A1’ it would still go on looking through the ‘A2’ results as well, would realise these didn’t match and set the flag to ‘N’. I was keeping a count of the number of matches but this part of the code was never reached if the ‘N’ flag was set. I’ve now updated how the checking for matches works and thankfully the ‘Or’ search now works when you set a ‘rated by’ to be more than 1.
For the reworking of the Seeing Speech and Dynamic Dialects websites I decided for focus on the accent map and accent chart features of Dynamic Dialects. For the map I switched to using the Leaflet.js mapping library rather than Google Maps. This is mainly because I prefer Leaflet, you can use it with lots of different map tilesets, data doesn’t have to be posted to Google for the map to work and other reasons, such as the fact that you can zoom in and out with the scrollwheel of a mouse without having to also press ‘ctrl’, which gets really annoying with the existing map. I’ve removed the option to switch from map to satellite and streetview as well as these didn’t really seem to serve much purpose. The new base map is a free map supplied by Esri (a big GIS company). It isn’t cluttered up with commercial map markers when zoomed in, unlike Google.
You can now hover over a map marker to view the location and area details. Clicking on a marker opens up a pop-up containing all of the information about the speaker and links to the videos as ‘play’ buttons. Note that unlike the existing map, buttons for sounds only appear if there are actually videos for them. E.g. on the existing map for Oregon there are links for every video type, but only one (spontaneous) actually works.
Clicking on a ‘play’ button brings down the video overlay, as with the other pages I’ve redeveloped. As with other pages, the URL is updated to allow direct linking to the video. Note that any map pop-up you have open does not remain open when you follow such a link, but as the location appears in the video overlay header it should be easy for a user to figure out where the relevant marker is when they close the overlay.
For the Accent Chart page I’ve added in some filter options, allowing you to limit the display of data to a particular area, age range and / or gender. These options can be combined, and also bookmarked / shared / cited (e.g. so you can follow a link to view only those rows where the area is ‘Scotland’ the age range is ’18-24’ and the gender is ‘F’). I’ve also added a row hover-over colour to help you keep your eye on a row. As with other pages, click on the ‘play’ button and a video overlay drops down. You can also cite / bookmark specific videos.
I’ve made the table columns on this page as narrow as possible, but it’s still a lot of columns and unless you have a very wide monitor you’re going to have to scroll to see everything. There are two ways I can set this up. Firstly the table area of the page itself can be set to scroll horizontally. This keeps the table within the boundaries of the page structure and looks more tidy, but it means you have to vertically scroll to the bottom of the table before you see the scrollbar, which is probably going to get annoying and may be confusing. The alternative is to allow the table to break out of the boundaries of the page. This looks messier, but the advantage is the horizontal scrollbar then appears at the bottom of your browser window and is always visible, even if you’re looking at the top section of the table. I’ve asked Jane and Eleanor how they would prefer the page to work.
My final project of the week was the Historical Thesaurus. I spent some time working on the new domain names we’re setting up for the thesaurus, and on Thursday I attended the lectures for the new lectureship post for the Thesaurus. It was very interesting to hear the speakers and their potential plans for the Thesaurus in future, but obviously I can’t say much more about the lectures here. I also attended the retirement do for Flora Edmonds on Thursday afternoon. Flora has been a huge part of the thesaurus team since the early days of its switch to digital and I think she had a wonderful send-off from the people in Critical Studies she’s worked closely with over the years.
On Friday I spent some time adding the mini timelines to the search results page. I haven’t updated the ‘live’ page yet but here’s an image showing how they will look:
It’s been a little tricky to add the mini-timelines in as the search results page is structured rather differently to the ‘browse’ page. However, they’re in place now, both for general ‘word’ results and for words within the ‘Recommended Categories’ section. Note that if you’ve turned mini-timelines off in the ‘browse’ page they stay off on this page too.
We will probably want to add a few more things in before we make this page live. We could add in the full timeline visualisation pop-up, that I could set up to feature all search results, or at least the results for the current page of search results. If I did this I would need to redevelop the visualisation to try and squeeze in at least some of the category information and the pos, otherwise the listed words might all be the same. I will probably try to add in each word’s category and pos, which should provide just enough context, although subcat names like ‘pertaining to’ aren’t going to be very helpful.
We will also need to consider adding in some sorting options. Currently the results are ordered by ‘Tier’ number, but I could add in options to order results by ‘first attested date’, ‘alphabetically’ and ‘length of attestation’. ‘Alphabetically’ isn’t going to be hugely useful if you’re looking at a page of ‘sausage’ results, but will be useful for wildcard searches (e.g. ‘*sage’) and other searches like dates. I would imagine ordering results by ‘length of attestation’ is going to be rather useful in picking out ‘important’ words. I’ll hopefully have some time to look into these options next week.