Week Beginning 16th September 2019

I spent some time this week investigating the final part of the SCOSYA online resource that I needed to implement: A system whereby researchers could request access to the full audio dataset and a member of the team could approve the request and grant the person access to a facility where the required data could be downloaded.  Downloads would be a series of large ZIP files containing WAV files and accompanying textual data.  As we wanted to restrict access to legitimate users only I needed to ensure that the ZIP files were not directly web accessible, but were passed through to a web accessible location on request by a PHP script.

I created a test version using a 7.5Gb ZIP file that had been created a couple of months ago for the project’s ‘data hack’ event.  This version can be set up to store the ZIP files in a non-web accessible directory and then grab a file and pass it through to the browser on request.  It will be possible to add user authentication to the script to ensure that it can only be executed by a registered user.  The actual location of the ZIP files is never divulged so neither registered nor unregistered users will ever be able to directly link to or download the files (other than via the authenticated script).

This all sounds promising but I realised that there are some serious issues with this approach.  HTTP as used by web pages to transfer files is not really intended for downloading huge files and using this web-based method to download massive zip files is just not going to work very well.  The test ZIP file I used was about 7.5Gb in size (roughly the size of a DVD), but the actual ZIP files are likely to be much larger than this – with the full dataset taking up about 180Gb.  Even using my desktop PC on the University network it’s taken roughly 30 minutes to download the 7.5Gb file.  Using an external network would likely take a lot longer and bigger files are likely to be pretty unmanageable for people to download.

It’s also likely that a pretty small number of researchers will be requesting the data, and if this is the case then perhaps it’s not such a good idea to take up 180Gb of web server space (plus the overheads of backups) to store data that is seldomly going to be accessed, especially if this is simply replicating data that is already taking up a considerable amount of space on the shared network drive.  180Gb is probably more web space than is used by most other Critical Studies websites combined.  After discussing this issue with the team, we decided that we would not set up such a web-based resource to access the data, but would instead send ZIP files on request to researchers using the University’s transfer service, which allows files of up to 20Gb to be sent to both internal and external email addresses.  We’ll need to see how this approach works out, but I think it’s a better starting point than setting up our own online system.

I also spent some further time on the SCOSYA project this week implementing some changes to both the experts and the public atlases based on feedback from the team.  This included changing the default map position and zoom level, replacing some of the colours used for map markers and menu items, tweaking the layout of the transcriptions, ensuring certain words in story titles can appear in bold (as opposed to the whole title being bold as was previously the case) and removing descriptions from the list of features found in the ‘Explore’ menu in the public atlas.  I also added a bit of code to ensure that internal links from story pages to other parts of the public atlas would work (previously they weren’t doing anything because only the part after the hash was changing).  I also ensured that the experts atlas side panel resizes to fit the content whenever an additional attribute is added or removed.

Also this week I finally found a bit of time to fix the map on the advanced search page of the SCOTS Corpus website.  This map was previously powered by Google Maps, but they have now removed free access to the Google Maps service (you now need to provide a credit card and get billed if your usage goes over a certain number of free hits a month).  As we hadn’t updated the map or provided such details Google broke the map, covering it with warning messages and removing our custom map styles.  I have now replaced the Google Maps version with a map created using the free to use Leaflet,js mapping library (as I’m using for SCOSYA) and a free map tileset from OpenStreetMap.  Other than that it works in exactly the same way as the old Google Map.  The new version is now live here: https://www.scottishcorpus.ac.uk/advanced-search/.

Also this week I upgraded all of the WordPress sites I manage, engaged in some App Store duties and had a further email conversation with Marc Alexander about how dates may be handled in the Historical Thesaurus.  I also engaged in a long email conversation with Heather Pagan of the Anglo-Norman Dictionary about accessing the dictionary data.  Heather has now managed to access one of the servers that the dictionary website runs on and we’re now trying to figure out exactly where the ‘live’ data is located so that I can work with it.  I also fixed a couple of issues with the widgets I’d created last week for the GlasgowMedHums project (some test data was getting pulled into them) and tweaked a couple of pages.  The project website is launching tomorrow so if anyone wants to access it they can do so here: https://glasgowmedhums.ac.uk/

Finally, I continued to work on the new API for the Dictionary of the Scots Language, implementing the bibliography search for the ‘v2’ API.  This version of the API uses data extracted from the original API, and the test website I’ve set up to connect to it should be identical to the live site, but connects to the ‘v2’ API to get all of its data and in no way connects to the old, undocumented API.  API calls to search the bibliographies (both a predictive search used for displaying the auto-complete results and to populate a full search results page), and to display an individual bibliography are now available and I’ve connected the test site to these API calls, so staff can search for bibliographies here.

Whilst investigating how to replicate the original API I realised that the bibliography search on the live site is actually a bit broken.  The ‘Full Text’ search simply doesn’t work, but instead just does the same as a search for authors and titles (in fact the original API doesn’t even include a ‘full text’ option).  Also, results only display authors, so for records with no author you get some pretty unhelpful results.  I did consider adding in a full-text search, but as bibliographies contain little other than authors and titles there didn’t seem much point, so instead I’ve removed the option.  As the search is primarily set up as an auto-complete, which is set up to match words in authors or titles that begin with the characters that are being typed (i.e. a wildcard search such as ‘wild*’) and the full search results page only gets displayed if someone ignores the auto-complete list of results and manually presses the ‘Search’ button, I’ve made full search results page always work as a ‘wild*’ search too.  So typing ‘aber’ into the search box and pressing ‘Search’ will bring up a list of all bibliographies with titles / authors featuring a word beginning with these characters.  With the previous version this wasn’t the case – you had to add a ‘*’ after ‘aber’ otherwise the full search results page would match ‘aber’ exactly and find nothing.  I’ve updated the help text on the bibliography search page to explain this a bit.

The full search results (and the results side panel) in the new version now include titles as well as authors, which makes things clearer and I’ve also made the search results numbering appear at the top of the corresponding result text rather than on the last line.  This is also the case for entry searches too.  Once the test site has been fully tested and approved we should be able to replace the live site with the new site (ensuring all WordPress content from the live site is carried over, of course).  Doing so will mean the old server containing the original API can (once we’re confident all is well) be switched off.  There is still the matter of implementing the bibliography search for the V3 data, but as mentioned previously this will probably be best tackled once we have sorted out the issues with the data and we are getting ready to launch the new version.