Week Beginning 14th May 2018

I again split my time mostly between REELS and Linguistic DNA and the Historical Thesaurus this week.  For REELS, Carole had sent an email with lots of feedback and suggestions, so I spent some time addressing these.  This included replacing the icon I’d chosen for settlements, and updating the default map zoom level to be a bit further out, so that the entire county fits on screen initially.  I also updated the elements glossary ordering so that Old English “æ” and “þ” appear as if they were ‘ae’ and ‘th’ rather than at the end of the lists, and set the ordering to ignore diacritics, which were messing up the ordering a little. I also took the opportunity to update the display of the glossary so that the whole entry box for each item isn’t a link.  This is because I’ve realised that some entries (e.g. St Leonard) have their own ‘find out more’ link and having a link within a link is never a good idea.  Instead, there is now a ‘Search’ button at the bottom right of the entry, and if the ‘find out more’ button is present this appears next to it.  I’ve changed the styling of the number of place-names and historical forms in the top right to make them look less like buttons too.

I also updated the default view of the map so that the ‘unselected’ data doesn’t appear on the map by default.  You now have to manually tick the checkbox in the legend to add these in if you want them.  When they are added in they appear ‘behind’ the other map markers rather than appearing on top of them, which was previously happening if you turned off the grey dots then turned them on again.

Leaflet has a method called ‘bringToBack’, which can be used to change the ordering of markers.  Unfortunately you can’t apply this to an entire layer group (i.e. apply it to all grey dots in my grey dots group with one call).  It took me a bit of time to figure out why this wasn’t working, but eventually I figured out I needed to call the ‘eachLayer’ method on my layer group to iterate over the contents and apply the ‘bringToBack’ method to each individual grey dot.

In addition to this update, I also set it so that changing marker categorisation in the ‘Display Options’ section now keeps the ‘Unselected’ dots off unless you choose to turn them on.  I think this will be better for most users.  I know when testing the map and changing categorisation the first thing I always then did was turn off the grey dots to reduce the clutter.

Carole had also pointed out an issue with the browse for sources, in that one source was appearing out of its alphabetical order and with more associated place-names than it should have.  It turned out that this was a bug introduced when I’d previously added a new field for the browse list that strips out all tags (e.g. italics) from the title.  This field gets populated when the source record is created or edited in the CMS.  Unfortunately, I’d forgotten that sources can be added and edited directly through the add / edit historical forms page too, and I hadn’t added in the code to populate the field in these places.  This meant that the field was being left blank, resulting in strange ordering and place-name numbering in the browse source page.

The biggest change that Carole had suggested was to the way in which date searches work.  Rather than having the search and browse options allow the user to find place-names that have historical forms with a start / end date within the selected date or date range, Carole reckoned that identifying the earliest date for a place-name would be more useful.  This was actually a pretty significant change, requiring a rewrite of large parts of the API, but I managed to get it all working.  End dates have now been removed from the search and browse.  The ‘browse start date’ looks for the earliest recorded start date rather than bringing back a count of place-names that have any historical form with the specified year, which I agree is much more useful.  The advanced search now allows you to specify a single year, a range of years, or you can use ‘<’ and ‘>’ to search for place-names whose earliest historical form has a start date before or after a particular date.

I also finally got round to replacing the base maps with free alternatives this week.  I was previously using MapBox maps for all but one of our base maps, but as MapBox only allows 50,000 map views in a month, and I’d managed almost 10,000 myself, we agreed that we couldn’t rely so heavily on the service, as the project has no ongoing funds.  Thanks to some very useful advice from Chris Fleet at the NLS, I managed to switch to some free alternatives, including three that are hosted by the NLS Maps people themselves.  The Default view is now Esri Topomap, the satellite view is now Esri WorldImagery (both free).  Satellite with labels is still MapBox (the only one now).  I’ve also included modern OS maps, courtesy of NLS, OS maps 1840-1880 from NLS and OS maps 1920-1933 as before.  We now have six base maps to choose from, and I think the resource is looking pretty good.  Here’s an example with OS Maps from the 1840s to 1880s selected:

For Linguistic DNA this week I continued to monitor my script that I’d set running last week to extract frequency data about the usage of Thematic Headings per decade in all of the EEBO data I have access to.  I had hoped that the process would have completed by Monday, and it probably would have done, were it not for the script running out of memory as it tried to tackle the category ‘AP:04 Number’.  This category is something of an outlier, and contains significantly more data than the other categories.  It contains more than 2,600,000 rows, of which almost 200,000 are unique.  My script stores all unique words in an associative array, with frequencies for each decade then added to it.  The more unique words the larger the array and the more memory required.  I skipped over the category and my script successfully dealt with the remaining categories, finishing the processing on Wednesday.  I then temporarily updated the PHP settings to remove memory restrictions and set my script to deal with ‘AP:04’, which took a while but completed successfully, resulting in a horribly large spreadsheet containing almost 200,000 rows.  I zipped the resulting 2,077 CSV files up and sent them on to the DHI people in Sheffield, who are going to incorporate this data into the LDNA resource.

For the Historical Thesaurus I continued to work on the new Timeline feature, this time adding in mini-timelines that will appear beside each word on the category page.  Marc suggested using the ‘Bullet Chart’ option that’s available in the jQuery Sparkline library found here: https://omnipotent.net/jquery.sparkline/#s-about  and I’ve been looking into this.

Initially I ran into some difficulty with the limited number of options available.  E.g. you can’t specify a start value for the chart, only an end value (although I later discovered that there is an undocumented setting for this in the source code), and individual blocks also don’t have start and end points but instead are single points that take their start value from the previous block.  Also, data needs to be added in reverse order or things don’t display properly.

I must admit that trying to figure out how to hack about with our data to fit it in as the library required gave me a splitting headache and I eventually abandoned the library and wondered whether I could just make a ‘mini’ version using the D3 timeline plugin I was already using.  After all there are lots of example of single bar timelines in the documentation:  https://github.com/denisemauldin/d3-timeline.  However, after more playing around with this library I realised that it just wasn’t very well suited to being shrunk to an inline size.  Things started to break in weird ways when the dimensions were made very small and I didn’t want to have to furtle about with the library’s source code too much, after already having had to do so for the main timeline.

So, after taking some ibuprofen I returned to the ‘Bullet Chart’ and finally managed to figure out how to make our data work and get it all added in reverse order.  As the start has to be zero, I made the end of the chart 1000, and all data has 1000 years taken off it.  If I hadn’t done this then OE would have started midway through the chart.  Individual years were not displaying due to being too narrow so I’ve added a range of 50 years on to them, which I later reduced to 20 years after feedback from Fraser.  I also managed to figure out how to reduce the thickness of the bar running along the middle of the visualisation.  This wasn’t entirely straightforward as the library uses HTML Canvas rather than SVG.  This means you can’t just view the source of the visualisation using the browser’s ‘select element’ feature and tinker with it.  Instead I had to hack about with the library’s source code to change the coordinates of the rectangle that gets created.  Here’s an example of where I’d got to during the week:

On Friday I returned to working on the mini-timelines, this time figuring out how to incorporate them into the test version of the Thesarus’s category page, for both main and subcats (which are handled differently in the code).  It took quite some time to figure out how the load the required data into the page so that the library could access it, but I figured it out (data is added to a hidden element in each word’s section and my JavaScript file can then access this as required).

I positioned the timelines to the right of each word’s section, next to the magnifying glass.  There’s a tooltip that displays the fulldate field on hover.  I figured out how to position the ‘line’ at the bottom of the timeline, rather in the middle, and I’ve disabled highlighting of sections on mouse over and have made the background look transparent.  It’s not, actually.  I tried this but the ‘white’ blocks actually cover up unwanted sections of the other colour so setting things to transparent messed up the timeline.  Instead the code works out if the row is odd or even and grabs the row colour based on this.  I had to remove the shades of grey from the subcat backgrounds to make this work.  But actually I think the page looks better without the subcats being in grey.  So, here is an example of the mini timelines in the category test page:

I think it’s looking pretty good.  The only downside is these mini-timelines sort of make my original full timeline a little obsolete.

I worked on a few other projects this week as well.  I sorted out access to the ‘Editing Burns’ website for a new administrator who has started, and I investigated some strange errors with the ‘Seeing Speech’ website whereby the video files were being blocked.  It turned out to be down to a new security patch that had been installed on the server and after Chris updated this things started working again.

I also met with Megan Coyer to discuss her ‘Hogg in Fraser’s Magazine’ project.  She had received XML files containing OCR text and metadata for all of the Fraser’s Magazine issues and wanted me to process the files to convert them to a format that she and her RA could more easily use.  Basically she wanted the full OCR text, plus the Record ID, title, volume, issue, publication date and contributor information to be added to one Word file.

There were 17,072 XML files and initially I wrote a script that grabbed the required data and generated a single HTML file, that I was then going to convert into DOCX format.  However, the resulting file was over 600Mb in size, which was too big to work with.  I decided therefore to generate individual documents for each volume in the data.  This results in 81 files (including one for all of the XML files that don’t seem to include a volume).  The files are a more manageable size, but are still thousands of pages long in Word. This seemed to suit Megan’s needs and I moved the Word files to her shared folder for her to work with.