I spent quite a bit of time finishing things off for the Speak For Yersel project. I created a stats page for the project team to access. The page allows you to specify a ‘from’ and ‘to’ date (it defaults to showing stats from the end of May to the end of the current day). If you want a specific day you can enter the same date in ‘from’ and ‘to’ (e.g. ‘2022-10-04’ will display stats for everyone who registered on the Tuesday after the launch).
The stats relate to users registered in the selected period rather than answers submitted in the selected period. If a person registered in the selected period then all of their answers are included in the figures, whether they were submitted in the period or not. If a person registered outside of the selected period but submitted answers during the selected period these are not included.
The stats display the total number of users registered in the selected period, split into the number who chose a location in Scotland and those who selected elsewhere. Then the total number of survey answers submitted by these two groups are shown, divided into separate sections for the five surveys. I might need to update the page to add more in at a later date. For example, one thing that isn’t shown is the number of people who completed each survey as opposed to only answering a few questions. Also, I haven’t included stats about the quizzes or activities yet, but these could be added.
I also worked on an abstract about the project for the Digital Humanities 2023 conference. In preparation for this I extracted all of the text relating to the project from this blog as a record of the development of the project. It’s more than 21,000 words long and covers everything from our first team discussions about potential approaches in September last year through to the launch of the site last week. I then went through this and pulled out some of the more interesting sections relating to the generation of the maps, the handling of user submissions and the automatic generation of quiz answers based on submitted data. I sent this to Jennifer for feedback and then wrote a second version. Hopefully it will be accepted for the conference, but even if it’s not I’ll hopefully be able to go as the DH conference is always useful to attend.
Also this week I attended a talk about a lemmatiser for Anglo-Norman that some researchers in France have developed using the Anglo-Norman dictionary. It was a very interesting talk and included a demonstration of the corpus that had been constructed using the tool. I’m probably going to be working with the team at some point later on, sending them some data from the underlying XML files of the Anglo-Norman Dictionary.
I also replaced the Seeing Speech videos with a new set the Eleanor Lawson had generated that were mirrored to match the videos we’re producing for the Speech Star project and investigated how I will get to Zurich for a thesaurus related workshop in January.
I spent the rest of the week working on the Books and Borrowing project, working on the ‘books’ tab in the library page. I’d started on the API endpoint for this last week, which returned all books for a library and then processed them. This was required as books have two title fields (standardised and original title), either one of which may be blank so to order to books by title the records first need to be returned to see which ‘title’ field to use. Also ordering by number of borrowings and by author requires all books to be returned and processed. This works fine for smaller libraries (e.g. Chambers has 961 books) but returning all books for a large library like St Andrews that has more then 8,500 books was taking a long time, and resulting in a JSON file that was over 6MB in size.
I created an initial version of the ‘books’ page using this full dataset, with tabs across the top for each initial letter of the title (browsing by author and number of borrowings is still to do) and a count of the number of books in each tab also displayed. Book records are then displayed in a similar manner to how they appear in the ‘page’ view, but with some additional data, namely total counts of the number of borrowings for the book holding record and counts of borrowings of individual items (if applicable). These will eventually be linked to the search.
The page looked pretty good and worked pretty well, but was very inefficient as the full JSON file needed to be generated and passed to the browser every time a new letter was selected. Instead I updated the underlying database to add two new fields to the book holding table. The first stores the initial letter of the title (standardised if present, original if not) and the second stores a count of the total number of borrowings for the holding record. I wrote a couple of scripts to add this data in, and these will need to be run periodically to refresh these cached fields as the do not otherwise get updated when changes are made in the CMS. Having these fields in place means the scripts will be able to pinpoint and return subsets of the books in the library at the database query level rather than returning all data and then subsequently processing it. This makes things much more efficient as less data is being processed at any one time.
I still need to add in facilities to browse the books by initial letter of the author’s surname and also facilities to list books by the number of borrowings, but for now you can at least browse books alphabetically by title. Unfortunately for large libraries there is still a lot of data to process even when only dealing with specific initial letters. For example, there are 1063 books beginning with ‘T’ in St Andrews so the returned data still takes quite a few seconds to load in.
That’s all for this week. I’ll be on holiday next week so there won’t be a further report until the week after that.
I spent most of my time this week getting back into the development of the front-end for the Books and Borrowing project. It’s been a long time since I was able to work on this due to commitments to other projects and also due to there being a lot more for me to do than I was expecting regarding processing images and generating associated data in the project’s content management system over the summer. However, I have been able to get back into the development of the front-end this week and managed to make some pretty good progress. The first thing I did was to make some changes to the ‘libraries’ page based on feedback I received ages ago from the project’s Co-I Matt Sangster. The map of libraries used clustering to group libraries that are close together when the map is zoomed out, but Matt didn’t like this. I therefore removed the clusters and turned the library locations back into regular individual markers. However, it is now rather difficult to distinguish the markers for a number of libraries. For example, the markers for Glasgow and the Hunterian libraries (back when the University was still on the High Street) are on top of each other and you have to zoom in a very long way before you can even tell there are two markers there.
I also updated the tabular view of libraries. Previously the library name was a button that when clicked on opened the library’s page. Now the name is text and there are two buttons underneath. The first one opens the library page while the second pans and zooms the map to the selected library, whilst also scrolling the page to the top of the map. This uses Leaflet’s ‘flyTo’ function which works pretty well, although the map tiles don’t quite load in fast enough for the automatic ‘zoom out, pan and zoom in’ to proceed as smoothly as it ought to.
After that I moved onto the library page, which previously just displayed the map and the library name. I updated the tabs for the various sections to display the number of registers, books and borrowers that are associated with the library. The Introduction page also now features the information recorded about the library that has been entered into the CMS. This includes location information, dates, links to the library etc. Beneath the summary info there is the map, and beneath this is a bar chart showing the number of borrowings per year at the library. Beneath the bar chart you can find the longer textual fields about the library such as descriptions and sources. Here’s a screenshot of the page for St Andrews:
I also worked on the ‘Registers’ tab, which now displays a tabular list of the selected library’s registers, and I also ensured that when you select one of the tabs other than ‘Introduction’ the page automatically scrolls down to the top of the tabs to avoid the need to manually scroll past the header image (but we still may make this narrower eventually). The tabular list of registers can be ordered by any of the columns and includes data on the number of pages, borrowers, books and borrowing records featured in each.
When you open a register the information about it is displayed (e.g. descriptions, dates, stats about the number of books etc referenced in the register) and large thumbnails of each page together with page numbers and the number of records on each page are displayed. The thumbnails are rather large and I could make them smaller, but doing so would mean that all the pages end up looking the same – beige rectangles. The thumbnails are generated on the fly by the IIIF server and the first time a register is loaded it can take a while for the thumbnails to load in. However, generated thumbnails are then cached on the server so subsequent page loads are a lot quicker. Here’s a screenshot of a register page for St Andrews:
One thing I also did was write a script to add in a new ‘pageorder’ field to the ‘page’ database table. I then wrote a script that generated the page order for every page in every register in the system. This picks out the page that has no preceding page and iterates through pages based on the ‘next page’ ID. Previously pages in lists were ordered by their auto-incrementing ID, but this meant that if new pages needed to be inserted for a register they ended up stuck at the end of the list, even though the ‘next’ and ‘previous’ links worked successfully. This new ‘pageorder’ field ensures lists of pages are displayed in the proper order. I’ve updated the CMS to ensure this new field is used when viewing a register, although I haven’t as of yet updated the CMS to regenerate the ‘pageorder’ for a register if new pages are added out of sequence. For now if this happens I’ll need to manually run my script again to update things.
Anyway, back to the front-end: The new ‘pageorder’ is used in the list of pages mentioned above so the thumbnails get displaying in the correct order. I may add pagination to this page, as all of the thumbnails are currently on one page and it can take a while to load, although these days people seem to prefer having long pages rather than having data split over multiple pages.
The final section I worked on was the page for viewing an actual page of the register, and this is still very much in progress. You can open a register page by pressing on its thumbnail and currently you can navigate through the register using the ‘next’ and ‘previous’ buttons or return to the list of pages. I still need to add in a ‘jump to page’ feature here too. As discussed in the requirements document, there will be three views of the page: Text, Image and Text and Image side-by-side. Currently I have implemented the image view only. Pressing on the ‘Image view’ tab opens a zoomable / pannable interface through which the image of the register page can be viewed. You can also make this interface full screen by pressing on the button in the top right. Also, if you’re viewing the image and you use the ‘next’ and ‘previous’ navigation links you will stay on the ‘image’ tab when other pages load. Here’s a screenshot of the ‘image view’ of the page:
Also this week I wrote a three-page requirements document for the redevelopment of the front-ends for the various place-names projects I’ve created using the system originally developed for the Berwickshire place-names project which launched back in 2018. The requirements document proposes some major changes to the front-end, moving to an interface that operates almost entirely within the map and enabling users to search and browse all data from within the map view rather than having to navigate to other pages. I sent the document off to Thomas Clancy, for whom I’m currently developing the systems for two place-names projects (Ayr and Iona) and I’ll just need to wait to hear back from him before I take things further.
I also responded to a query from Marc Alexander about the number of categories in the Thesaurus of Old English, investigated a couple of server issues that were affecting the Glasgow Medical Humanities site, removed all existing place-name elements from the Iona place-names CMS so that the team can start afresh and responded to a query from Eleanor Lawson about the filenames of video files on the Seeing Speech site. I also made some further tweaks to the Speak For Yersel resource ahead of its launch next week. This included adding survey numbers to the survey page and updating the navigation links and writing a script that purges a user and all related data from the system. I ran this to remove all of my test data from the system. If we do need to delete a user in future (either because their data is clearly spam or a malicious attempt to skew the results, or because a user has asked us to remove their data) I can run this script again. I also ran through every single activity on the site to check everything was working correctly. The only thing I noticed is that I hadn’t updated the script to remove the flags for completed surveys when a user logs out, meaning after logging out and creating a new user the ticks for completed surveys were still displaying. I fixed this.
I also fixed a few issues with the Burns mini-site about Kozeluch, including updating the table sort options which had stopped working correctly when I added a new column to the table last week and fixing some typos with the introductory text. I also had a chat with the editor of the Anglo-Norman Dictionary about future developments and responded to a query from Ann Ferguson about the DSL bibliographies. Next week I will continue with the B&B developments.
I completed an initial version of the Chambers Library map for the Books and Borrowing project this week. It took quite a lot of time and effort to implement the subscription period range slider. Searching for a range when the data also has a range of dates rather than a single date means we needed to make a decision about what data gets returned and what doesn’t. This is because the two ranges (the one chosen as a filter by the user and the one denoting the start and end periods of subscription for each borrower) can overlap in many different ways. For example, the period chosen by the user is 05 1828 to 06 1829. Which of the following borrowers should therefore be returned?
- Borrowers range is 06 1828 to 02 1829: Borrower’s range is fully within the period so should definitely be included
- Borrowers range is 01 1828 to 07 1828: Borrower’s range extends beyond the selected period at the start and ends within the selected period. Presumably should be included.
- Borrowers range is 01 1828 to 09 1829: Borrower’s range extends beyond the selected period in both directions. Presumably should be included.
- Borrowers range is 05 1829 to 09 1829: Borrower’s range begins during the selected period and ends beyond the selected period. Presumably should be included.
- Borrowers range is 01 1828 to 04 1828: Borrower’s range is entirely before the selected period. Should not be included
- Borrowers range is 07 1829 to 10 1829: Borrower’s range is entirely after the selected period. Should not be included.
Basically if there is any overlap between the selected period and the borrower’s subscription period the borrower will be returned. But this means most borrowers will always be returned a lot of the time. It’s a very different sort of filter to one that purely focuses on a single date – e.g. filtering the data to only those borrowers whose subscription periods *begins* between 05 1828 and 06 1829.
Based on the above assumptions I began to write the logic that would decide which borrowers to include when the range slider is altered. It was further complicated by having to deal with months as well as years. Here’s the logic in full if you fancy getting a headache:
if(((mapData[i].sYear>startYear || (mapData[i].sYear==startYear && mapData[i].sMonth>=startMonth)) && ((mapData[i].eYear==endYear && mapData[i].eMonth <=endMonth) || mapData[i].eYear<endYear)) || ((mapData[i].sYear<startYear ||(mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth)) && ((mapData[i].eYear==endYear && mapData[i].eMonth >=endMonth) || mapData[i].eYear>endYear)) || ((mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth || mapData[i].sYear>startYear) && ((mapData[i].eYear==endYear && mapData[i].eMonth <=endMonth) || mapData[i].eYear<endYear) && ((mapData[i].eYear==startYear && mapData[i].eMonth >=startMonth) || mapData[i].eYear>startYear)) || (((mapData[i].sYear==startYear && mapData[i].sMonth>=startMonth) || mapData[i].sYear>startYear) && ((mapData[i].sYear==endYear && mapData[i].sMonth <=endMonth) || mapData[i].sYear<endYear) && ((mapData[i].eYear==endYear && mapData[i].eMonth >=endMonth) || mapData[i].eYear>endYear)) || ((mapData[i].sYear<startYear ||(mapData[i].sYear==startYear && mapData[i].sMonth<=startMonth)) && ((mapData[i].eYear==startYear && mapData[i].eMonth >=startMonth) || mapData[i].eYear>startYear)))
I also added the subscription period to the popups. The only downside to the range slider is that the occupation marker colours change depending on how many occupations are present during a period, so you can’t always tell an occupation by its colour. I might see if I can fix the colours in place, but it might not be possible.
I also noticed that the jQuery UI sliders weren’t working very well on touchscreens so installed the jQuery TouchPunch library to fix that (https://github.com/furf/jquery-ui-touch-punch). I also made the library marker bigger and gave it a white border to more easily differentiate it from the borrower markers.
I then moved onto incorporating page images in the resource too. Where a borrower has borrower records the relevant pages where these borrowing records are found now appear as thumbnails in the borrower popup. These are generated by the IIIF server based on dimensions passed to it, which is much nicer than having to generate and store thumbnails directly. I also updated the popup to make it wider when required to give more space for the thumbnails. Here’s a screenshot of the new thumbnails in action:
Clicking on a thumbnail opens a further popup containing a zoomable / pannable image of the page. This proved to be rather tricky to implement. Initially I was going to open a popup in the page (outside of the map container) using a jQuery UI Dialog. However, I realised that this wouldn’t work when the map was being viewed in full-screen mode, as nothing beyond the map container is visible in such circumstances. I then considered opening the image in the borrower popup but this wasn’t really big enough. I then wondered about extending the ‘Map options’ section and replacing the contents of this with the image, but this then caused issues for the contents of the ‘Map options’ section, which didn’t reinitialise properly when the contents were reinstated. I then found a plugin for the Leaflet mapping library that provides a popup within the map interface (https://github.com/w8r/Leaflet.Modal) and decided to use this. However, it’s all a little complex as the popup then has to include another mapping library called OpenLayers that enables the zooming and panning of the page image, all within the framework of the overall interactive map. It is all working and I think it works pretty well, although I guess the map interface is a little cluttered, what with the ‘Map Options’ section, the map legend, the borrower popup and then the page image popup as well. Here’s a screenshot with the page image open:
All that’s left to do now is add in the introductory text once Alex has prepared it and then make the map live. We might need to rearrange the site’s menu to add in a link to the Chambers Map as it’s already a bit cluttered.
Also for the project I downloaded images for two further library registers for St Andrews that had previously been missed. However, there are already records for the registers and pages in the CMS so we’re going to have to figure out a way to work out which image corresponds to which page in the CMS. One register has a different number of pages in the CMS compared to the image files so we need to work out how to align the start and end and if there are any gaps or issues in the middle. The other register is more complicated because the images are double pages whereas it looks like the page records in the CMS are for individual pages. I’m not sure how best to handle this. I could either try and batch process the images to chop them up or batch process the page records to join them together. I’ll need to discuss this further with Gerry, who is dealing with the data for St Andrews.
Also this week I prepared for and gave a talk to a group of students from Michigan State University who were learning about digital humanities. I talked to them for about an hour about a number of projects, such as the Burns Supper map (https://burnsc21.glasgow.ac.uk/supper-map/), the digital edition I’d created for New Modernist Editing (https://nme-digital-ode.glasgow.ac.uk/), the Historical Thesaurus (https://ht.ac.uk/), Books and Borrowing (https://borrowing.stir.ac.uk/) and TheGlasgowStory (https://theglasgowstory.com/). It went pretty and it was nice to be able to talk about some of the projects I’ve been involved with for a change.
I also made some further tweaks to the Gentle Shepherd Performances page which is now ready to launch, and helped Geert out with a few changes to the WordPress pages of the Anglo-Norman Dictionary. I also made a few tweaks to the WordPress pages of the DSL website and finally managed to get a hotel room booked for the DHC conference in Sheffield in September. I also made a couple of changes to the new Gaelic Tongues section of the Seeing Speech website and had a discussion with Eleanor about the filters for Speech Star. Fraser had been in touch with about 500 Historical Thesaurus categories that had been newly matched to OED categories so I created a little script to add these connections to the online database.
I also had a Zoom call with the Speak For Yersel team. They had been testing out the resource at secondary schools in the North East and have come away with lots of suggested changes to the content and structure of the resource. We discussed all of these and agreed that I would work on implementing the changes the week after next.
Next week I’m going to be on holiday, which I have to say I’m quite looking forward to.
This week I finished off all of the outstanding work for the Speak For Yerself project. The other members of the team (Jennifer and Mary) are both on holiday so I finished off all of the tasks I had on my ‘to do’ list, although there will certainly be more to do once they are both back at work again. The tasks I completed were a mixture of small tweaks and larger implementations. I made tweaks to the ‘About’ page text and changed the intro text to the ‘more give your word’ exercise. I then updated the age maps for this exercise, which proved to be pretty tricky and time-consuming to implement as I needed to pull apart a lot of the existing code. Previously these maps showed ‘60+’ and ‘under 19’ data for a question, with different colour markers for each age group showing those who would say a term (e.g. ‘Scunnered’) and grey markers for each age group showing those who didn’t say the term. We have completely changed the approach now. The maps now default to showing ‘under 19’ data only, with different colours for each different term. There is now an option in the map legend to switch to viewing the ‘60+’ data instead. I added in the text ‘press to view’ to try and make it clearer that you can change the map. Here’s a screenshot:
I also updated the ‘give your word’ follow-on questions so that they are now rated in a new final page that works the same way as the main quiz. In the main ‘give your word’ exercise I updated the quiz intro text and I ensured that the ‘darker dots’ explanatory text has now been removed for all maps. I tweaked a few questions to change their text or the number of answers that are selectable and I changed the ‘sounds about right’ follow-on ‘rule’ text and made all of the ‘rule’ words lower case. I also made it so that when the user presses ‘check answers’ for this exercise a score is displayed to the right and the user is able to proceed directly to the next section without having to correct their answers. They still can correct their answers if they want.
I then made some changes to the ‘She sounds really clever’ follow-on. The index for this is now split into two sections, one for ‘stereotype’ data and one for ‘rating speaker’ data and you can view the speaker and speaker/listener results for both types of data. I added in the option of having different explanatory text for each of the four perception pages (or maybe just two – one for stereotype data, one for speaker ratings) and when viewing the speaker rating data the speaker sound clips now appear beneath the map. When viewing the speaker rating data the titles above the sliders are slightly different. Currently when selecting the ‘speaker’ view the title is “This speaker from X sounds…” as opposed to “People from X sound…”. When selecting the ‘speaker/listener’ view the title is “People from Y think this speaker from X sounds…” as opposed to “People from Y think people from X sound…”. I also added a ‘back’ button to these perception follow-on pages so it’s easier to choose a different page. Finally, I added some missing HTML <title> tags to pages (e.g. ‘Register’ and ‘Privacy’) and fixed a bug whereby the ‘explore more’ map sound clips weren’t working.
With my ‘Speak For Yersel’ tasks out of the way I could spend some time looking at other projects that I’d put on hold for a while. A while back Eleanor Lawson contacted me about adding a new section to the Seeing Speech website where Gaelic speaker videos and data will be accessible, and I completed a first version this week. I replicated the Speech Star layout rather than the /r/ & /l/ page layout as it seemed more suitable: the latter only really works for a limited number of records while the former works well with lots more (there are about 150 Gaelic records). What this means is the data has a tabular layout and filter options. As with Speech Star you can apply multiple filters and you can order the table by a column by clicking on its header (clicking a second time reverses the order). I’ve also included the option to open multiple videos in the same window. I haven’t included the playback speed options as the videos already include the clip at different speeds. Here’s a screenshot of how the feature looks:
On Thursday I had a Zoom call with Laura Rattray and Ailsa Boyd to discuss a new digital edition project they are in the process of planning. We had a really great meeting and their project has a lot of potential. I’ve offered to give technical advice and write any technical aspects of the proposal as and when required, and their plan is to submit the proposal in the autumn.
My final major task for the week was to continue to work on the Ramsay ‘Gentle Shepherd’ data. I overhauled the filter options that I implemented last week so they work in a less confusing way when multiple types are selected now. I’ve also imported the updated spreadsheet, taking the opportunity to trim whitespace to cut down on strange duplicates in the filter options. There are some typos you’ll need to fix in the spreadsheet, though (e.g. we have ‘Glagsgow’ and ‘Glagsow’) plus some dates still need to be fixed.
I then created an interactive map for the project and have incorporated the data for which there are latitude and longitude values. As with the Edinburgh Gazetteer map of reform societies (https://edinburghgazetteer.glasgow.ac.uk/map-of-reform-societies/) the number of performances at a venue is displayed in the map marker. Hover over a marker to see info about the venue. Click on it to open a list of performances. Note that when zoomed out it can be difficult to make out individual markers but we can’t really use clustering as on the Burns Supper map (https://burnsc21.glasgow.ac.uk/supper-map/) because this would get confusing: we’d have clustered numbers representing the number of markers in a cluster and then induvial markers with a number representing the number of performances. I guess we could remove the number of performances from the marker and just have this in the tooltip and / or popup, but it is quite useful to see all the numbers on the map. Here’s a screenshot of how the map currently looks:
I still need to migrate all of this to the University’s T4 system, which I aim to tackle next week.
Also this week I had discussions about migrating an externally hosted project website to Glasgow for Thomas Clancy. I received a copy of the files and database for the website and have checked over things and all is looking good. I also submitted a request for a temporary domain and I should be able to get a version of the site up and running next week. I also regenerated a list of possible duplicate authors in the Books and Borrowing system after the team had carried out some work to remove duplicates. I will be able to use the spreadsheet I have now to amalgamate duplicate authors, a task which I will tackle next week.
I seem to be heading through a somewhat busy patch at the moment, and had to focus my efforts on five major projects and several other smaller bits of work this week. The major projects were SCOSYA, Books and Borrowing, DSL, HT and Bess of Hardwick’s Account books. For SCOSYA I continued to implement the public atlas, this week focussing on the highlighting of groups. I had hoped that this would be a relatively straightforward feature to implement, as I had already created facilities to create and view groups in the atlas I’d made for the content management system. However, it proved to be much trickier than I’d anticipated as I’d rewritten much of the atlas code in order to incorporate the GeoJSON areas as well as purely point-based data, plus I needed to integrate the selection of groups and the loading of group locations with the API. My existing code for finding the markers for a specified group and adding a coloured border was just not working, and I spent a frustratingly long amount of time debugging the code to find out what had changed to stop the selection from finding anything. It turned out that in my new code I was reinstantiating the variable I was using to hold all of the point data within a function, meaning that the scope of the variable containing the data was limited to that function rather than being available to other functions. Once I figured this out it was a simple fix to make the data available to the parts of the code that needed to find and highlight relevant markers and I then managed to make groups of markers highlight or ‘unhighlight’ at the press of a button, as the following screenshot demonstrates:
You can now select one or more groups and the markers in the group are highlighted in green. Press a group button a second time to remove the highlighting. However, there is still a lot to be done. For one thing, only the markers highlight, not the areas. It’s proving to be rather complicated to get the areas highlighted as these GeoJSON shapes are handled quite differently to markers. I spent a long time trying to get the areas to highlight without success and will need to return to this another week. I also need to implement highlighting in different colours, so each group you choose to highlight is given a different colour to the last. Also, I need to find a way to make the selected groups be remembered as you change from points to areas to both, and change speaker type, and also possibly as you change between examples. Currently the group selection resets but the selected group buttons remain highlighted, which is not ideal.
I also spend time this week on the pilot project for Matthew Sangster’s Books and Borrowing project, which is looking at University student (and possibly staff) borrowing records from the 18th century. Matthew has compiled a spreadsheet that he wants me to create a searchable / browsable online resource for and my first task was to extract the data from the spreadsheet, create an online database and write a script to migrate the data to this database. I’ve done this sort of task many times before, but unfortunately things are rather more complicated this time because Matthew has included formatting within the spreadsheet that needs to be retained in the online version. This includes superscript text throughout the more than 8000 records and simply saving the spreadsheet as a CSV file and writing a script to go through each cell and upload the data won’t work as the superscript style will be lost in the conversion to CSV. PHPMyAdmin also includes a facility to import a spreadsheet in the OpenDocument format, but unfortunately this not only removes the superscript format but also the text that is specified as superscript as well.
Therefore I had to investigate other ways of getting the data out of the spreadsheet while somehow retaining the superscript formatting. The only means of doing so that I could think of was to save the spreadsheet as an HTML document, which would convert Excel’s superscript formatting into HTML superscript tags, which is what we’d need for displaying the data on a website anyway. Unfortunately the HTML generated by Excel is absolutely awful and filled with lots of unnecessary junk that I then needed to strip out manually. I managed to write a script that extracted the data (including the formatting for superscript) and import this into the online database for about 8000 of the 8200 rows, but the remainder had problems that prevented the insertion from taking place. I’ll need to think about creating multiple passes for the data when I return to it next week.
For the DSL this week I spent rather a lot of time engaged in email conversations with Rhona Alcorn about the tasks required to sort out the data that the team have been working on for several years and which now needs to be extracted from older systems and migrated to a new system, plus the API that I am working on. It looked like there would be a lot of work for me to do with this, but thankfully midway through the week it became apparent that the company who are supplying the new system for managing the DSL’s data have a member of staff who is expecting to do a lot of the tasks that had previously been assigned to me. This is really good news as I was beginning to worry about the amount of work I wold have to do for the DSL and how I would fit this in around other work commitments. We’ll just need to see how this all pans out.
I also spent some time implementing a Boolean search for the new DSL API. I now have this in place and working for headword searches, which can be performed via the ‘quick search’ box on the test sites I’ve created. It’s possible to use Boolean AND, OR and NOT (all must be entered upper case to be picked up) and a search can be used in combination with wildcards, and speech-marks can now be used to specify an exact search. So, for example, if you want to find all the headwords beginning with ‘chang’ but wish to exclude results for ‘change’ and ‘chang’ you can enter ‘chang* NOT “change” NOT “chang”’.
OR searches are likely to bring back lots of results and at the moment I’ve not put a limit on the results, but I will do so before things go live. Also, while there are no limits on the number of Booleans that can be added to a query, results when using multiple Booleans are likely to get a little weird due to there being multiple ways a query could be interpreted. E.g. ‘Ran* OR run* NOT rancet’ still brings back ‘rancet’ because the query is interpreted as ‘get all the ‘ran*’ results OR all the ‘run*’ results so long as they don’t include ‘rancet’ – so ran* OR (run* NOT rancet). But without complicating things horribly with brackets or something similar there’s no way of preventing such ambiguity when multiple different Booleans are used.
For the Historical Thesaurus I met with Marc and Fraser on Monday to discuss our progress with the HT / OED linking and afterwards continued with a number of tasks that were either ongoing or had been suggested at the meeting. This included ticking off some matches from a monosemous script, creating a new script that brings back up to 1000 random unmatched lexemes at a time for spot-checking and creating an updated Levenshtein script for lexemes, which is potentially going to match a further 5000 lexemes. I also wrote a document detailing how I think that full dates should be handled in the HT, to replace the rather messy way dates are currently recorded. We will need to decide on a method in order to get the updated dates from the OED into a comparable format.
Also this week I returned to Alison Wiggins’s Account Books project, or rather a related output about the letters of Mary, Queen of Scots. Alison had sent me a database containing a catalogue of letters and I need to create a content management system to allow her and other team members to work on this together. I’ve requested a new subdomain for this system and have begun to look at the data and will get properly stuck into this next week, all being well.
Other than these main projects I also gave feedback on Thomas Clancy’s Iona project proposal, including making some changes to the Data Management Plan, helped sort out access to logo files for the Seeing Speech project, sorted out an issue with the Editing Burns blog that was displaying no content since the server upgrade (it turns out it was using a very old plugin that was not compatible with the newer version of PHP on the server) and helped sort out some app issues. All in all a very busy week.
I continued to develop the public interface for the SCOSYA project this week, and also helped out with the preparations for next week’s Data Hack event that the project is organising, which involved sorting out hosted for a lot of sample data. On Monday I had a meeting with Jennifer and E, at which we went through the interface I had so far created and discussed things that needed updated or changed in some way. It was a useful meeting and I came away with a long list of things to do, which I then spent quite some time during the remainder of the week implementing. This included changing the font used throughout the site and drastically changing the base layer we use for the maps. I had previously created a very simple ‘green land, blue sea’ base map, which is what the team had requested, but they wanted to try something a bit simpler still – white sea and light grey land – in order to emphasise the data points more than anything else. I also removed all place-names from the map and in fact everything other than borders and water. I also updated the colour range used for ratings, from a yellow to red scheme to a more grey / purple scheme that had been suggested by E. This is now used both for the markers and for the areas. Regarding areas, I removed the white border from the areas to make areas with the same rating blend into one another and make the whole thing look more like a heatmap, as the following screenshot demonstrates:
I also completely changed the way the pop-ups look, as it was felt that the previous version was just a bit too garish and comic book like. The screenshot below shows markers with a pop-up open:
I also figured out how to add sound clips to story slides and I’ve changed how the selection of ‘examples’ works. Rather than having a drop-down list and then all of the information about a selected feature displayed underneath I have split things up. Now when you open the ‘Examples’ section you will see the examples listed as a series of buttons. Pressing on one of these then loads the feature, automatically loading the data for it into the map. There’s a button for returning to the list of examples, then the feature’s title and description, followed by sound clips if there are any are displayed. Underneath this are the buttons for changing ‘speakers’ and ‘locations’. Pressing on one of these options now automatically refreshes the map so there’s no longer any need for a ‘Show’ button. I think this works much better. Note that your choice of speaker and location is remembered when using the map – e.g. if you have selected ‘Young’ and ‘Areas’ then go back and select a different example then the map will default to ‘Young’ and ‘Areas’ when this new feature is displayed.
I’ve also added a check for screen size that fires every time a side panel section is opened. This ensures that if someone has resized their browser or changed the orientation of their screen the side panel should still fit. I still haven’t had time to get the ‘groups’ feature working yet, or to fix the display of stories on smaller screens. I also need to update the ‘Learn more’ section so it uses a list rather than a drop-down box, all tasks I hope to continue with next week.
I also spent a bit of time on the Seeing Speech and Dynamic Dialects projects, helping to add in a new survey for each, participated in the monthly College of Arts developers coffee catch-up and advised a couple of members of staff on blog related issues and spoke to Kirsteen McCue about the proposal she’s putting together.
Other than these tasks I spent about a day working on DSL issues. This included getting some data to Ann about which existing DSL entries were not present in the dataset that had been newly extracted from the server. This appears to have been caused by some entries being merged with existing entries. I also managed to get the new dataset uploaded to our temporary web-server and created a new API that outputs this new data. I still need to create an alternative version of the DSL front-end that connects to this new version of the data, which I hope to be able to at least get started on next week. I also did some investigation into scripts that Thomas Widmann had discussed in some hand-over documentation that did not seem to be available anywhere and discussed some issues relating to the server the DSL people host in their offices.
I also spent some time working on HT duties, making some tweaks to existing scripts based on feedback from Fraser, investigating why one of our categories is not accessible via the website (the answer being it was a subcategory that didn’t have a main category in the same part of speech so had no category to ‘hang’ off). I also had a further meeting with Marc and Fraser on Friday to discuss our progress with the HT OED linking.
This week I mainly working on three projects: The Historical Thesaurus, the Bilingual Thesaurus and the Romantic National Song Network. For the HT I continued with the ongoing and seemingly never-ending task of joining up the HT and OED datasets. Marc, Fraser and I had a meeting last Friday and I began to work through the action points from this meeting on Monday. By Wednesday I had ticked off most of the items, which I’ll summarise here.
Whilst developing the Bilingual Thesaurus I’d noticed that search term highlighting on the HT site wasn’t working for quick searches, only advanced searches for words, so I investigated and fixed this. I then updated the lexeme pattern matching / date matching script to incorporate the stoplist we’d created during last week’s meeting (words or characters that should be removed when comparing lexemes, such as ‘to ‘ and ‘the’). This worked well and has bumped matches up to better colour levels, but has resulted in some words getting matched multiple times. E.g. when removing ‘to’, ‘of’ etc this results in a form that then appears multiple times. For example, in one category the OED has ‘bless’ twice (presumably an erro?) and HT has ‘bless’ and ‘bless to’. With ‘to’ removed there then appear to be more matches that there should be. However, this is not an issue when dates are also taken into consideration. I also updated the script so that categories where there are 3 matches and at least 66% of words match have been promoted from orange to yellow.
When looking at the outputs at the meeting Marc wondered why certain matches (e.g. 120202 ‘relating to doctrine or study’ / ‘pertaining to doctrine/study’ and 88114 ‘other spec.’ / ‘other specific’) hadn’t been ticked off and wondered whether category heading pattern matching had worked properly. After some investigation I’d say it has worked properly – the reason these haven’t been ticked off is they contain too few words to have reached the criteria for ticking off.
Another script we looked at during our meeting was the sibling matching script, which looks for matches at the same hierarchical level and part of speech, but different numbers. I completely overhauled the script to bring it into line with the other scripts (including recent updates such as the stoplist for lexeme matching and the new yellow criteria). There are currently 19, 17 and 25 green, lime green and yellow matches that could be ticked off. I also ticked off the empty category matches listed on the ‘thing heard’ script (so long as they have a match) and for the ‘Noun Matching’ I ticked off the few matches that there were. Most were empty categories and there were less than 15 in total.
Another script I worked on was the ‘monosemous’ script, which looks for monosemous forms in unmatched categories and tries to identify HT categories that also contain these forms. We weren’t sure at the meeting whether this script identified words that were fully monosemous in the entire dataset, or those that were monosemous in the unmatched categories. It turned out it was the former, so I updated the script to only look through the unchecked data, which has identified further monosemous forms. This has helped to more accurately identify matched categories. I also created a QA script that checks the full categories that have potentially been matched by the monosemous script.
I also worked on the date fingerprinting script. This gets all of the start dates associated with lexemes in a category, plus a count of the number of times each date appears, and uses these to try and find matches in the HT data. I updated this script to incorporate the stoplist and the ‘3 matches and 66% match’ yellow rule, and ticked off lots of matches that this script identified. I ticked off all green (1556), lime green (22) and yellow (123) matches.
Out of curiosity, I wrote a script that looked at our previous attempt at matching the categories, which Fraser and I worked on last year and earlier this year. The script looks at categories that were matched during this ‘v1’ process that had yet to be matched during our current ‘v2’ process. For each of these the script performs the usual checks based on content: comparing words and first dates and colour coding based on number of matches (this includes the stoplist and new yellow criteria mentioned earlier). There are 7148 OED categories that are currently unmatched but were matched in V1. Almost 4000 of these are empty categories. There are 1283 ‘purple’ matches, which means (generally) something is wrong with the match. But there are 421 in the green, lime green and yellow sections, which is about 12% of the remaining unmatched OED categories that have words. It might also be possible to spot some patterns to explain why they were matched during v1 but have yet to be matched in v2. For example, 2711 ‘moving water’ has 01.02.06.01.02 and its HT counterpart has 01.02.06.01.01.02. There are possibly patterns in the 1504 orange matches that could be exploited too.
Finally, I updated the stats page to include information about main and subcats. Here are the current unmatched figures:
Unmatched (with POS): 8629
Unmatched (with POS and not empty): 3414
Unmatched Main Categories (with POS): 5036
Unmatched Main Categories (with POS and not empty): 1661
Unmatched Subcategories (with POS): 3573
Unmatched Subcategories (with POS and not empty): 1753
So we are getting there!
For the Bilingual Thesaurus I completed an initial version of the website this week. I have replaced the original colour scheme with a ‘red, white and blue’ colour scheme as suggested by Louise. This might be changed again, but for now here is an example of how the resource looks:
The ‘quick’ and ‘advanced’ searches are also now complete, using the ‘search words’ mentioned in a previous post, and ignoring accents on characters. As with the HT, by default the quick search matches category headings and headwords exactly, so ‘ale’ will return results as there is a category ‘ale’ and also a word ‘ale’ but ‘bread’ won’t match anything because there are no words or categories with this exact text. You need to use an asterisk wildcard to find text within word or category text: ‘bread*’ would find all items starting with ‘bread’, ‘*bread’ would find all items ending in ‘bread’ and ‘*bread*’ would find all items with ‘bread’ occurring anywhere.
The ‘advanced search’ lets you search for any combination of headword, category, part of speech, section, dates and languages or origin and citation. Note that if you specify a range of years in the date search it brings back any word that was ‘active’ in your chosen period. E.g. a search for ‘1330-1360’ will bring back ‘Edifier’ with a date of 1100-1350 because it was still in use in this period.
As with the HT, different search boxes are joined with ‘AND’ – e.g. if you tick ‘verb’ and select ‘Anglo Norman’ as the section then only words that are verbs AND Anglo Norman will be returned. Where search types allow multiple options to be selected (i.e. part of speech and languages of origin and citation) if multiple options in each list are selected these are joined by ‘OR’. E.g. if you select ‘noun’ and ‘verb’ and select ‘Dutch’, ‘Flemish’ and ‘Italian’ as languages or origin this will find all words that are either nouns OR verbs AND have a language of origin of Dutch OR Flemish OR Italian.
For the Romantic National Song Network I continued to create timelines and ‘storymaps’ based on powerpoint presentations that had been sent to me. This is proving to be a very time-intensive process, as it involves extracting images, audio files and text from the presentations, formatting the text as HTML, reworking the images (resizing, sometimes joining multiple images together to form one image, changing colour levels, saving the images, uploading them to the WordPress site), uploading the audio files, adding in the HTML5 audio tags to get the audio files to play, creating the individual pages for each timeline entry / storymap entry. It took the best part of an afternoon to create one timeline for the project, which involved over 30 images, about 10 audio files and more than 20 Powerpoint slides. Still, the end result works really well, so I think it’s worth putting the effort in.
In addition to these projects I met with a PhD student, Ewa Wanat, who wanted help in creating an app. I spent about a day attempting to make a proof of concept for the app, but unfortunately the tools I work with are just not very well suited to the app she wants to create. The app would be interactive and highly dependent on logging user interactions as accurately as possible. I created looked into using the d3.js library to create the sort of interface she wanted (a circle that rotates with smaller circles attached to it, that the user should tap on when a certain point in the rotation is reached), but although this worked, the ‘tap’ detection was not accurate enough. In fact on touchscreens more often than not a ‘tap’ wasn’t even being registered. D3.js just isn’t made to deal with time-sensitive user interaction on animated elements and I have no experience with any libraries that are made in this way, so unfortunately it looks like I won’t be able to help out with this project. Also, Ewa wanted the app to be launched in January and I’m just far too busy with other projects to be able to do the required work in this sort of timescale.
Also this week I helped extract some data about the Seeing Speech and Dynamic Dialects videos for Eleanor Lawson, I responded to queries from Meg MacDonald and Jennifer Nimmo about technical work on proposals they are involved with, I responded to a request for advice from David Wilson about online surveys, and another request from Rachel Macdonald about the use of Docker on the SPADE server. I think that’s just about everything to report.
I spent most of my time this week split between three projects: The HT / OED category linking, the REELS project and the Bilingual Thesaurus. For the HT I continued to work on scripts to try and match up the HT and OED categories. This week I updated all the currently in use scripts so that date checks now extract the first four numeric characters (OE is converted to 1000 before this happens) from the ‘GHT_date1’ field in the OED data and the ‘fulldate’ field in the HT data. Doing this has significantly improved the matching on the first date lexeme matching script. Greens have gone from 415 to 1527, lime greens from 2424 to 2253, yellows from 988 to 622 and oranges from 2363 to 1788. I also updated the word lists to make them alphabetical, so it’s easier to compare the two lists and included two new columns. The first is for matched dates (ignoring lexeme matching), which is a count of the number of dates in the HT and OED categories that match while the second is this figure as a percentage of the total number of OED lexemes.
However, taking dates in isolation currently isn’t working very well, as if a date appears multiple times it generates multiple matches. So, for example, the first listed match for OED CID 94551 has 63 OED words, and all 63 match for both lexeme and date. But lots of these have the same dates, meaning a total count of matched dates is 99, or 152% of the number of OED words. Instead I think we need to do something more complicated with dates, making a note of each one AND the number of times each one appears in a category as its ‘date fingerprint’.
I created a new script to look at ‘date fingerprints’. The script generates arrays of categories for HT and OED unmatched categories. The dates of each word (or each word with a GHT date in the case of OED) in every category is extracted and a count of these is created (e.g. if the OED category 5678 has 3 words with 1000 as a date and 1 word with 1234 as a date then its ‘fingerprint’ is 5678[1000=>3,1234=>1]. I ran this against the HT database to see what matches.
The script takes about half an hour to process. It grabs each unmatched OED category that contains words, picks out those that have GHT dates, gets the first four numerical figures of each and counts how many times this appears in the category. It does the same for all unmatched HT categories and their ‘fulldate’ column too. The script then goes through each OED category and for each goes through every HT category to find any that have not just the same dates, but the same number of times each date appears too. If everything matches the information about the matched categories is displayed.
The output has the same layout as the other scripts but where a ‘fingerprint’ is not unique a category (OED or HT) may appear multiple times, linked to different categories. This is especially common for categories that only have one or two words, as the combination of dates is less likely to be unique. For an example of this search for our old favourite ‘extra-terrestrial’ and you’ll see that as this is the only word in its category, any HT categories that also have one word and the same start date (1963) are brought back as potential matches. Nothing other than the dates are used for matching purposes – so a category might have a different POS, or be in a vastly different part of the hierarchy. But I think this script is going to be very useful.
I also created a script that ignores POS when looking for monosemous forms, but this hasn’t really been a success. It finds 4421 matches as opposed to 4455, I guess because some matches that were 1:1 are being complicated by polysemous HT forms in different parts of speech.
With these updates in place, Marc and Fraser gave the go-ahead for connections to be ticked off. Greens, lime greens and yellows from ‘lexeme first date matching’ script have now been ticked off. There were 1527, 2253 and 622 in these respective sections, so a total of 4402 ticked off. That takes us down to 6192 unmatched OED categories that have a POS and are not empty, or 11380 unmatched that have a POS if you include empty ones. I then ‘unticked’ the 350 purple rows from the script I’d created to QA the ‘erroneous zero’ rows that had been accidentally ticked off last week. This means we now have 6450 unmatched OED categories with words, or 11730 including those without words. I then ticked off all of the ‘thing heard’ matches other than some rows tht Marc had spotted as being wrong. 1342 have been ticked off, bringing our unchecked but not empty total down to 5108 and our unchecked including empty total down to 10388. On Friday, Marc, Fraser and I had a further meeting to discuss our next steps, which I’ll continue with next week.
For the REELS project I continued going through my list of things to do before the project launch. This included reworking the Advanced Search layout, adding in tooltip text, updating the start date browse, which was including ‘inactive’ data in it’s count, created some further icons for combinations of classification codes, added in Creative Commons logos and information, added an ‘add special character’ box to the search page, added a ‘show more detail’ option to the record page that displays the full information about place-name elements, added an option to the API and Advanced Search that allows you to specify if your element search looks at current forms, historical forms or both, added in Google Analytics, updated the site text and page structure to make the place-name search and browse facilities publicly available, created a bunch of screenshots for the launch, set up the server on my laptop for the launch and made everything live. You can now access the place-names here: https://berwickshire-placenames.glasgow.ac.uk/ (e.g. by doing a quick search or choosing to browse place-names)
I also investigated a strange situation Carole had encountered with the Advanced Search, whereby a search for ‘pn’ and ‘<1500’ brings back ‘Hassington West Mains’, even though it only has a ‘pn’ associated with a Historical form from 1797. The search is really ‘give me all the place-names that have an associated ‘pn’ element and also have an earliest historical form before 1500’. The usage of elements in particular historical forms and their associated dates is not taken into consideration – we’re only looking at the earliest recorded date for each place-name. Any search involving historical form data is treated in the same way – e.g. if you search for ‘<1500’ and ‘Roy’ as a source you also get Hassington West Mains as a result, because its earliest recorded historical form is before 1500 and it includes a historical form that has ‘Roy’ as a source. Similarly if you search for ‘<1500’ and ‘N. mains’ as a historical form you’ll also get Hassignton West Mains, even though the only historical form before 1500 is ‘(lands of) Westmaynis’. This is because again the search is ‘get me all of the place-names with a historical form before 1500 that have any historical form including the text ‘N. mains’. We might need to make it clearer that ‘Earliest start date’ refers to the earliest historical form for a place-name record as a whole, not the earliest historical form in combination with ‘historical form’, ‘source’, ‘element language’ or ‘element’.
On Saturday I attended the ‘Hence the Name’ conference run by the Scottish Place-name Society and the Scottish Records Association, where we launched the website. Thankfully everything went well and we didn’t need to use the screenshots or the local version of the site on my laptop, and the feedback we received about the resource was hugely positive.
For the Bilingual Thesaurus I continued to implement the search facilities for the resource. This involved stripping out a lot of code from the HT’s search scripts that would not be applicable to the BTH’s data, and getting the ‘quick search’ feature to work. After getting this search to actually bring back data I then had to format the results page to incorporate the fields that were appropriate for the project’s data, such as the full hierarchy, whether the word results are Anglo Norman or Middle English, dates, parts of speech and such things. I also had to update the category browse page to get search result highlighting to work and to get the links back to search results working. I then made a start on the advanced search form.
Other than these projects I also spoke to fellow developer David Wilson to give him some advice on Data Management Plans, I emailed Gillian Shaw with some feedback on the University’s Technician Commitment, I helped out Jane with some issues relating to web stats, I gave some advice to Rachel Macdonald on server specifications for the SPADE project, I replied to two PhD students who had asked me for advice on some technical matters, and I gave some feedback to Joanna Kopaczyk about hardware specifications for a project she’s putting together.
After a rather hectic couple of weeks this was a return to a more regular sort of week, which was a relief. I still had more work to do than there was time to complete, but it feels like the backlog is getting smaller at least. As with previous weeks, I continued with the HT / OED linking of categories processes this week, following on from the meeting Marc, Fraser and I had the Friday before. For the lexeme / data matching script I separated out categories with zero matches that have words from the orange list into a new list with a purple background. So orange now only contains categories where at least one word and its start date match. The ones now listed in purple are almost certainly incorrect matches. I also changed the ordering of results so that categories are listed by the largest number of matches, to make it easier to spot matches that are likely ok.
I also updated the ‘monosemous’ script, so that the output only contains OED categories that feature a monosemous word and is split into three tables (with links to each at the top of the page). The first table features 4455 OED categories that include a monosemous word that has a comparable form in the HT data. Where there are multiple monosemous forms they each correspond to the same category in the HT data. The second table features 158 OED categories where the linked HT forms appear in more than one category. This might either be because the word is not monosemous in the HT data and appears in two different categories (these are marked with the text ‘red|’ they can be search for in page. An OED category can also appear in this table even if there are no red forms if (for example) one of the matched HT words is in a different category to all of the others (see OED catid 45524) where the word ‘Puncican’ is found in a different HT category to the other words). The final table contains those OED categories that feature monosemous words that have no match in the HT data. There are 1232 of these. I also created a QA script for the 4455 matched monosemous categories, which applies the same colour coding and lexeme matching as other QA scripts I’ve created. On Friday we had another meeting to discuss the findings and plan our next steps, which I will continue with next week.
Also this week I wrote an initial version of a Data Management Plan for Thomas Clancy’s Iona project, and commented on the DMP assessment guidelines that someone from the University’s Data Management people had put together. I can’t really say much more about these activities, but it took at least a day to get all of this done. I also did some app management duties, setting up an account for a new developer, and made the new Seeing Speech and Dynamic Dialects websites live. These can now be viewed here: https://www.seeingspeech.ac.uk/ and here: https://www.dynamicdialects.ac.uk/. I also had an email conversation with Rhona Alcorn about Google Analytics for the DSL site.
With the REELS project’s official launch approaching, I spent a bit of time this week going through the 23 point ‘to do’ list I’d created last week. In fact, I added another three items to it. I’m going to tackle the majority of the outstanding issues next week, but this week I investigated and fixed an issue with the ‘export’ script in the Content Management System. The script is very memory intensive and it was exceeding the server’s memory limits, so asking Chris to increase this limit sorted the issue. I also updated the ‘browse place-names’ feature of the CMS, adding a new column and ordering facility to make it clearer which place-names actually appear on the website. I also updated the front-end so that it ‘remembers’ whether you prefer the map or the text view of the data using HTML5 local storage and added in information about the Creative Commons license to the site and the API. I investigated the issue of parish boundary labels appearing on top of icons, but as of yet I’ve not found a way to address this. I might return to it before the launch if there’s time, but it’s not a massive issue. I moved all of the place-name information on the record page above the map, other than purely map-based data such as grid reference. I also removed the option to search the ‘analysis’ field from the advanced search and updated the element ‘auto-complete’ feature so that it only now matches the starting letters of an element rather than any letters. I also noticed that the combination of ‘relief’ and ‘water’ classifications didn’t have an icon on the map, so I created one for it.
I also continued to work on the Bilingual Thesaurus website this week. I updated the way in which source links work. Links to dictionary sources now appear as buttons in the page, rather in a separate pop-up. They feature the abbreviation (AND / MED / OED) and the magnifying glass icon and if you hover over a button the non-abbreviated form appears. For OED links I’ve also added the text ‘subscription required’ to the hover-over text. I also updated the word record so that where language of origin is ‘unknown’ the language of origin no longer gets displayed, and I made the headword text a bit bigger so it stands out more. I also added the full hierarchy above the category heading in the category section of the browse page, to make it easier to see exactly where you are. This will be especially useful for people using the site on narrow screens as the tree appears beneath the category section so is not immediately visible. You can click on any of the parts of the hierarchy here to jump to that point.
I then began to work on the search facility, and realised I needed to implement a ‘search words’ list that features variants. I did this for the Historical Thesaurus and it’s really useful. What I’ve done so far is generate alternatives for words that have brackets and dashes. For example, the headword ‘Bond(e)-man’ has the following search terms: Bond(e)-man, Bond-man, Bonde-man, Bond(e) man, Bond man, Bonde man, Bond(e)man, Bondman, Bondeman. None of these varieties will ever appear on the website, but instead will be used to find the word when people search. I’ll need some feedback as to whether these options will suffice, but for now I’ve uploaded variants to a table and began to get the quick search working. It’s not entirely there yet, but I should get this working next week. I also need to know what should be done about accented characters for search purposes. The simplest way to handle them would be to just treat them as non-accented characters – e.g. searching for ‘alue’ will find ‘alué’. However, this does mean you won’t be able to specifically search for words that include accented characters – e.g. a search for all the words featuring an ‘é’ will just bring back all characters with an ‘e’ in them.
I was intending to add a count of the number of words in each hierarchical level to the browse, or at least to make hierarchical levels that include words bold in the browse, so as to let users know whether it’s worthwhile clicking on a category to view the words at this level. However, I’ve realised that this will just confuse users as levels that have no words in them but include child categories that do have words in them would be listed with a zero or not in bold, giving the impression that there is no content lower down the hierarchy.
My last task for the week was to create a new timeline for the RNSN project based on data that had been given to me. I think this is looking pretty good, but unfortunately making these timelines and related storymaps is very time-intensive, as I need to extract and edit the images, upload them to WordPress, extract the text and convert it into HTML and fill out the template with all of the necessary fields. It took about 2 and a half hours to make this timeline. However, hopefully the end result will be worth it.
I continued to work on the HT / OED data alignment for a lot of this week. I updated the matching scripts I had previously created so that all matches based on last lexeme were removed and instead replaced by a ‘6 matches or more and 80% of words in total match’ check. This was a lot more effective that purely comparing the last word in each category and helped match up a lot more categories. I also created a QA script to check the manual matches that were made during our first phase of matching. There are 1407 manual matches in the system. The script also listed all the words in each potential matched category to make it easier to tell where any potential difficulties were. I also updated the ‘pattern matching’ script I’d created last week to list all words and include the ‘6 matches and 80%’ check and changed the layout so that separate groupings now appear in different tables rather than being all mixed up in one table. It took quite a long time to sort this out, but it’s going to be much more useful for manual checking.
I then moved on to writing a new ‘sibling matching’ script. This script goes through all unmatched OED categories (this includes all that appear in other scripts such as the pattern matching one) and retrieves all sibling categories of the same POS. E.g. if the category is ‘01.01.01|03 (n)’ then the script brings back all HT noun subcats of ’01.01.01’ that are ‘level 1’ subcats and compares their headings. It then looks to see if there is a sibling category that has the same heading – i.e. looking for when a category has been renumbered within the same level of the thesaurus. This has uncovered several hundred such potential matches, which will hopefully be very helpful. I also then created a further script that compares non-noun headings to noun headings at the same level, as it looked like a number of times the OED kept the noun heading for other parts of speech while the HT renamed them. This identified a further 65 possible matches, which isn’t too bad.
I met with Marc and Fraser on Wednesday to discuss the recent updates I’d made, after which I managed to tick off 2614 matched categories, taking our total of unmatched OED categories that have a part of speech and are not empty down to 10,854. I then made a start on a new script that looks at pattern matching for category contents (i.e. words), but I didn’t have enough time to make a huge amount of progress with this.
to try and get things working but the callbacks were never being initiated – i.e. data wasn’t getting through to Google. Thankfully Stack Overflow had an answer that worked (After trying several that didn’t):
I’ve updated this so that pageviews rather than events are sent and now everything seems to be working again.
I spent a bit more time this week working on the Bilingual Thesaurus project, focussing on getting the front end for the thesaurus working. I’ve reworked the code for the HT’s browse facility to work with the project’s data. This required quite a lot of work as structurally the datasets are quite different – the HT relies in its ‘tier’ numbers for parent / child / sibling category relationships, and also has different categories for parts of speech and nested subcategories. The BTH data is much simpler (which is great) as it just has parent and child categories, with things like part of speech handled at word level. This meant I had to strip a lot of stuff out of the code and rework things. I’m also taking the opportunity to move to a new interface library (Bootstrap) so had to rework the page layout to take this into consideration too. I managed to get an initial version of the browse facility working now, which works in much the same way as the main HT site: clicking on a heading allows you to view its words and clicking on a ‘plus’ sign allows you to view the child categories. As with the HT you can link directly to a category too. I do still need to work on the formatting of the category contents, though. Currently words are just listed all together, with their type (AN or ME) listed first, then the word, then the POS in brackets, then dates (if available). I haven’t included data about languages of source or citation yet, or URLs. I’m also going to try and get the timeline visualisations working as well. I’ll probably split the AN and ME words into separate tabs, and maybe split the list up by POS too. I’m also wondering whether the full category hierarchy should be represented above the selected category (the right pane), as unlike the HT there’s no category number to show your position in the thesaurus. Also, as a lot of the categories are empty I’m thinking of making the ones with words in them bold in the tree, or even possibly adding a count of words in brackets after the category heading. I’ve also updated the project’s homepage to include the ‘sample category’ feature, allowing you to press the ‘reload’ icon to load a new random category.
On Friday I spent most of the day working on the RNSN project, adding direct links to the ‘nation’ introductions to the main navigation menu and creating new ‘storymap’ stories based on Powerpoint presentations that had been sent to me. This is actually quite a time-consuming process as it involves grabbing images from the PPT, reformatting them, uploading them to WordPress, linking to them from the Storymap pages, creating Zoomified versions of the image or images that will be used as the ‘map’ for the story, extracting audio files from the PPT and uploading them, grabbing all of the text and formatting it for display and other such tasks. However, despite being a long process the end result is definitely worth it as the stroymaps work very nicely. I managed to get two such stories completed today, and now I’ve re-familiarised myself with the process it should be quicker when the next set get sent to me.
I’m going to be on holiday next week so there won’t be another report from me until the week after that.