Week Beginning 10th February 2020

I met with Fraser Dallachy on Monday to discuss his ongoing pilot Scots Thesaurus project.  It’s been a while since I’ve been asked to do anything for this project and it was good to meet with Fraser and talk through some of the new automated processes he wanted me to try out.  One thing he wanted to try was tag the DSL dictionary definitions for part of speech to see if we could then automatically pick out word forms that we could query against the Historical Thesaurus to try and place the headword within a category.  I adapted a previous script I’d created that picked out random DSL entries.  This script targetted main entries (i.e. not supplements) that were nouns, were monosemous and had one sense, had fewer than 5 variant spellings, single word headwords and ‘short’ definitions, with the option to specify what is meant by ‘short’ in terms of the number of characters.  I updated the script to bring back all DOST entries that met these criteria and had definitions that were less than 1000 characters in length, which resulted in just under 18,000 rows being returned (but I will rerun the script with a smaller character count if Fraser wants to focus on shorter entries).  The script also stripped out all citations and tags from the definition to prepare it for POS tagging.  With this dataset exported as a CSV I then began experimenting with a POS Tagger.  I decided to use the Stanford POS Tagger (https://nlp.stanford.edu/software/tagger.html) which can be run at the command line, and I created a PHP script that went through each row of the CSV, passed the prepared definition text to the Tagger, pulled in the output and stored it in a database.  I left the process running overnight and it had completed the following morning.  I then outputted the rows as a spreadsheet and sent them on to Fraser for feedback.  Fraser also wanted to see about using the data from the Scots School Dictionary so I sent that on to him too.

I also did a little bit of work for the DSL, investigating why some geographically tagged information was not being displayed in the citations, and replied to a few emails from Heather Pagan of the Anglo-Norman Dictionary as she began to look into uploading new data to their existing and no longer supported dictionary management system.  I also gave some feedback on a proposal written by Rachel Douglas, a lecturer in French.  Although this is not within Critical Studies and should be something Luca Guariento looks at, he is currently on holiday so I offered to help out.  I also set up an initial WordPress site for Matthew Creasey’s new project.  This still needs some further work, but I’ll need further information from Matthew before I can proceed.  On Wednesday I met with Jennifer Smith and E Jamieson to discuss a possible follow-on project for the Scots Syntax Atlas.  We talked through some of the possibilities and I think the project has huge potential.  I’ll be helping to write the Data Management Plan and other such technical things for the proposal in due course.

I met with Marc and Fraser on Friday to discuss our plans for updating the way dates are stored in the Historical Thesaurus, which will make it much more easy to associate labels with specific dates and to update the dates in future as we align the data with revisions from the OED.  I’d previously written a script that generated the new dates and from these generated a new ‘full date’ field which I then matched against the original ‘full date’ to spot errors.  The script identified 1,116 errors, but this week I updated my script to change the way it handled ‘b’ dates.  These are the dates that appear after a slash and where the date after the slash is in the same decade as the main date only one digit should be displayed (e.g. 1975/6), but this is not done so consistently, with dates sometimes appearing as 1975/76.  Where this happened my script was noting the row as an error, but Marc wanted these to be ignored.  I updated my script to take this into consideration, and this has greatly reduced the number of rows that will need to be manually checked, reducing the output to just 284 rows.

I spent the rest of my time this week working on the Books and Borrowers project.  Although this doesn’t officially begin until the summer I’m designing the data structure at the moment (as time allows) so that when the project does start the RAs will have a system to work with sooner rather than later.  I mapped out all of the fields in the various sample datasets in order to create a set of ‘core’ fields, mapping the fields from the various locations to these ‘core’ fields.  I also designed a system for storing additional fields that may only be found at one or two locations, are not ‘core’ but still need to be recorded.  I then created the database schema needed to store the data in this format and wrote a document that details all of this which I sent to Katie Halsey and Matt Sangster for feedback.

Matt also sent me a new version of the Glasgow Student borrowings spreadsheet he had been working on, and I spent several hours on Friday getting this uploaded to the pilot online resource I’m working on. I experimented with a new method of extracting the data from Excel to try and minimise the number of rows that were getting garbled due to Excel’s horrible attempts to save files as HTML.  As previously documented, the spreadsheet uses formatting in a number of columns (e.g. superscript, strikethrough).  This formatting is lost if the contents of the spreadsheet are copied in a plain text way (so no saving as a CSV, or opening the file in Google Docs or just copying the contents).  The only way to extract the formatting in a way that can be used is to save the file as HTML in Excel and then work with that.  But the resulting HTML produced by Excel is awful, with hundreds of tags and attributes scattered across the file used in an inconsistent and seemingly arbitrary way.

For example, this is the HTML for one row:

<tr height=23 style=’height:17.25pt’>

<td height=23 width=64 style=’height:17.25pt;width:48pt’></td>

<td width=143 style=’width:107pt’>Charles Wilson</td>

<td width=187 style=’width:140pt’>Charles Wilson</td>

<td width=86 style=’width:65pt’>Charles</td>

<td width=158 style=’width:119pt’>Wilson<span

style=’mso-spacerun:yes’> </span></td>

<td width=88 style=’width:66pt’>Nat. Phil.</td>

<td width=129 style=’width:97pt’>Natural Philosophy</td>

<td width=64 style=’width:48pt’>B</td>

<td class=xl69 width=81 style=’width:61pt’>10</td>

<td class=xl70 width=81 style=’width:61pt’>3</td>

<td width=250 style=’width:188pt’>Wells Xenophon vol. 3<font class=”font6″><sup>d</sup></font></td>

<td width=125 style=’width:94pt’>Mr Smith</td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’>Adam Smith</td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’></td>

<td width=89 style=’width:67pt’>22 Mar 1757</td>

<td width=89 style=’width:67pt’>10 May 1757</td>

<td align=right width=56 style=’width:42pt’>2</td>

<td width=64 style=’width:48pt’>4r</td>

<td class=xl71 width=64 style=’width:48pt’>1</td>

<td class=xl70 width=64 style=’width:48pt’>007</td>

<td class=xl65 width=325 style=’width:244pt’><a

href=”http://eleanor.lib.gla.ac.uk/record=b1592956″ target=”_parent”>https://eleanor.lib.gla.ac.uk/record=b1592956</a></td>

<td width=293 style=’width:220pt’>Xenophon.</td>

<td width=392 style=’width:294pt’>Opera quae extant omnia; unà cum

chronologiâa Xenophonteâ <span style=’display:none’>cl. Dodwelli, et quatuor

tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</span></td>

<td width=110 style=’width:83pt’>Wells, Edward, 16<span style=’display:none’>67-1727.</span></td>

<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll

Bi2-g.19-23</td>

<td align=right width=64 style=’width:48pt’>1</td>

<td align=right width=121 style=’width:91pt’>1</td>

<td width=64 style=’width:48pt’>T111427</td>

<td width=64 style=’width:48pt’></td>

<td width=64 style=’width:48pt’></td>

<td width=64 style=’width:48pt’></td>

</tr>

Previously I tried to fix this by running through several ‘find and replace’ passes to try and strip out all of the rubbish, while retaining what I needed, which was <tr>, <td> and some formatting tags such as <sup> for superscript.

This time I found a regular expression that removes all attributes from HTML tags, so for example <td width=64 style=’width:48pt’> becomes <td> (see it here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag).  I could then pass the resulting contents of every <td> through PHP’s strip_tags function to remove any remaining tags that were not required (e.g. <span>) while specifying the tags to retain (e.g. <sup>).

 

This approach seemed to work very well until I analysed the resulting rows and realised that the columns of many rows were all out of synchronisation, meaning any attempt at programmatically extracting the data and inserting it into the correct field in the database would fail.  After some further research I realised that Excel’s save as HTML feature was to blame yet again.  Without there being any clear reason, Excel sometimes expands a cell into the next cell or cells if these cells are empty.  An example of this can be found above and I’ve extracted it here:

<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll Bi2-g.19-23</td>

The ‘colspan’ attribute means that the cell will stretch over multiple columns, in this case 2 columns, but elsewhere in the output file it was 3 and sometimes 4 columns.  Where this happens the following cells simply don’t appear in the HTML.  As my regular expression removed all attributes this ‘colspan’ was lost and the row ended up with subsequent cells in the wrong place.

Once I’d identified this I could update my script to check for the existence of ‘colspan’ before removing attributes, adding in the required additional empty cells as needed (so in the above case an extra <td></td>).

With all of this in place the resulting HTML was much cleaner.  Here is the above row after my script had finished:

<tr>

<td></td>

<td>Charles Wilson</td>

<td>Charles Wilson</td>

<td>Charles</td>

<td>Wilson</td>

<td>Nat. Phil.</td>

<td>Natural Philosophy</td>

<td>B</td>

<td>10</td>

<td>3</td>

<td>Wells Xenophon vol. 3<sup>d</sup></td>

<td>Mr Smith</td>

<td></td>

<td></td>

<td>Adam Smith</td>

<td></td>

<td></td>

<td>22 Mar 1757</td>

<td>10 May 1757</td>

<td>2</td>

<td>4r</td>

<td>1</td>

<td>007</td>

<td>https://eleanor.lib.gla.ac.uk/record=b1592956</td>

<td>Xenophon.</td>

<td>Opera quae extant omnia; unà cum

chronologiâa Xenophonteâ cl. Dodwelli, et quatuor

tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</td>

<td>Wells, Edward, 16<span>67-1727.</td>

<td>Sp Coll

Bi2-g.19-23</td><td></td>

<td>1</td>

<td>1</td>

<td>T111427</td>

<td></td>

<td></td>

<td></td></tr>

I then updated my import script to pull in the new fields (e.g. normalised class and professor names), set it up so that it would not import any rows that had ‘Yes’ in the first column, and updated the database structure to accommodate the new fields too.  The upload process then ran pretty smoothly and there are now 8145 records in the system.  After that I ran the further scripts to generate dates, students, professors, authors, book names, book titles and classes and updated the front end as previously discussed.  I still have the old data stored in separate database tables as well, just in case we need it, but I’ve tested out the front-end and it all seems to be working fine to me.