Week Beginning 27th January 2020

I divided most of my time between three projects this week.  For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset.  On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying.  This process had finished on Monday, but unfortunately things had gone wrong during the processing.  I was using the PHP function ‘fgetcsv’ to extract the data a line at a time.  This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database.  Unfortunately some of the data contained commas.  In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database.  After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again.  It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong.  Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8.  I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8.  It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time.  However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again.  Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.

After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows.  I then wrote another script that would take this data and insert it into the project’s place-name table.  The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field.  The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish.  I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them.  I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.

The other thing I tried to do was to grab the altitude for each name via the Google Maps service.  This proved to be a little tricky as the service blocks you if you make too many requests all at once.  Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database.  Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these).  I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic.  Sound files must be in the MP3 format.

The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site.  I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest.  There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page.  I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens.  I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.

My third major project of the week was the Historical Thesaurus.  Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out.  I managed to create a script that can process any date, including associating labels with the appropriate date.  Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen.  As of yet nothing is inserted into the database.  I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field.  I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly.  At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values.  We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null.  Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’.  I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.

The script displays each lexeme in a category with its ‘full date’ column.  It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain.  Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.

I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead.  Then the ‘midlascon’ field is used to join the mid date to the last.  This is an error as ‘firlastcon’ should not be used to join first and mid dates.  An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’.  There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.

After getting the script to sort out the dates I then began the look at labels.  I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared.  However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field.  E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.

Instead I abandoned the ‘label’ field and just used the ‘full date’ field.  Actually, I still use the ‘label’ field to check whether the script needs to process labels or not.  Here’s a description of the logic for working out where a label should be added:

The dates are first split up into their individual boxes.  Then, if there is a label for the lexeme I go through each date in turn.  I split the full date field and look at the part after the date.  I go through each character of this in turn.  If the character is a ‘+’ then I stop.  If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop.  Otherwise if the character is a-z I note that I’ve started the label.  If I’ve started the label and the current character is a number then I stop.  Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached.  After that if there is any label text it’s added to the date.  This process seems to work.  I did, however, have to fix how labels applied to ‘current’ dates are processed.  For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string.  I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.

In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.