Week Beginning 17th February 2020

This was a three-day week for me as I was participating in the UCU strike action on Thursday and Friday.  I spent the majority of Monday to Wednesday tying up loose ends and finishing off items on my ‘to do’ list.  The biggest task I tackled was to relaunch the Digital Humanities at Glasgow site. This involved removing all of the existing site and moving the new site from its test location to the main URL.  I had to write a little script that changed all of the image URLs so they would work in the new location and I needed to update WordPress so it knew where to find all of the required files.  Most of the migration process went pretty smoothly, but there were some slightly tricky things, such as ensuring the banner slideshow continued to work.  I also needed to tweak some of the static pages (e.g. the ‘Team’ and ‘About’ pages) and I added in a ‘contact us’ form.  I also put in redirects from all of the old pages so that any bookmarks or Google links will continue to work.  As of yet I’ve had no feedback from Marc or Lorna about the redesign of the site, so I can only assume that they are happy with how things are looking.  The final task in setting up the new site was to migrate my blog over from its old standalone URL to become integrated in the new DH site.  I exported all of my blog posts and categories and imported them into the new site using WordPress’s easy to use tools, and that all went very smoothly.  The only thing that didn’t transfer over using this method was the media.  All of the images embedded in my blog posts still pointed to image files located on the old server so I had to manually copy these images over and then I wrote a script that went through every blog post and found and replaced all image URLs.  All now appears to be working, and this is the first blog post I’m making using the new site so I’ll know for definite once I add this post to the site.

Other than setting up this new resource I made a further tweak to the new data for Matthew Sangster’s 18th century student borrowing records that I was working on last week.  I had excluded rows from his spreadsheet that had ‘Yes’ in the ‘omit’ column, assuming that these were not to be processed by my upload script, but actually Matt wanted these to be in the system and displayed when browsing pages, but omitted from any searches.  I therefore updated the online database to include a new ‘omit’ column, updated my upload script to only process these omitted rows and then changed the search facilities to ignore any rows that have a ‘Y’ in the ‘omit’ column.

I also responded to a query from Alasdair Whyte regarding parish boundaries for his Place-names of Mull and Ulva project, and investigated an issue that Marc Alexander was experiencing with one of the statistics scripts for the HT / OED matching process (it turned out to be an issue with a bad internet connection rather than an issue with the script).  I’d had a request from Paul Malgrati in Scottish Literature about creating an online resource that maps Burns Suppers so I wrote a detailed reply discussing the various options.

I also spent some time fixing the issue of citation links not displaying exactly as the DSL people were wanting in the DSL website.  This was surprisingly difficult to implement because the structure can vary quite a bit.  The ID needed for the link is associated with the ‘cref’ that wraps the whole reference, but the link can’t be applied to the full contents as only authors and titles should be links, not geo and date tags or other non-tagged text that appears in the element.  There may be multiple authors or no author so sometimes the link needs to start before the first (and only the first) author whereas other times the link needs to start before the title.  As there is often text after the last element that needs to be linked the closing tag of the link can’t just be appended to the text but instead the script needs to find where this last element ends. However, it looks like I’ve figured out a way to do it that appears to work.

I devoted a few spare hours on Wednesday to investigating the Anglo-Norman Dictionary.  Heather was trying to figure out where on the server the dataset that the website uses is located, and after reading through the documentation I managed to figure out that the data is stored in a Berkeley DB and it looks like the data the system uses is stored in a file called ‘entry_hash’.  There is a file with this name in ‘/var/data’ and it’s just over 200Mb in size, which suggests it contains a lot of data.  Software that can read this file can be downloaded from here: https://www.oracle.com/database/technologies/related/berkeleydb-downloads.html and the Java edition should work on a Windows PC if it has Java installed.  Unfortunately you have to register to access the files and I haven’t done so yet, but I’ve let Heather know about this.

I then experimented with setting up a simple AND website using the data from ‘all.xml’, which (more or less) contains all of the data for the dictionary.  My test website consists of a database, one script that goes through the XML file and inserts each entry into the database, and one script that displays a page allowing you to browse and view all entries.  My AND test is really very simple (and is purely for test purposes) – the ‘browse’ from the live site (which can be viewed here: http://www.anglo-norman.net/gate/) is replicated, only it currently displays in its entirety, which makes the page a little slow to load.  Cross references are in yellow, main entries in white.  Click on a cross reference and it loads the corresponding main entry, click on a regular headword to load its entry.  Currently only some of the entry page is formatted, and some elements don’t always display correctly (e.g. the position of some notes).  The full XML is displayed below the formatted text.  Here’s an example:

Clearly there would still be a lot to do to develop a fully functioning replacement for AND, but it wouldn’t take a huge amount of time (I’d say it could be measured in days not weeks).  It just depends whether the AND people want to replace their old system.

I’ll be on strike again next week from Monday to Wednesday.

Week Beginning 10th February 2020

I met with Fraser Dallachy on Monday to discuss his ongoing pilot Scots Thesaurus project.  It’s been a while since I’ve been asked to do anything for this project and it was good to meet with Fraser and talk through some of the new automated processes he wanted me to try out.  One thing he wanted to try was tag the DSL dictionary definitions for part of speech to see if we could then automatically pick out word forms that we could query against the Historical Thesaurus to try and place the headword within a category.  I adapted a previous script I’d created that picked out random DSL entries.  This script targetted main entries (i.e. not supplements) that were nouns, were monosemous and had one sense, had fewer than 5 variant spellings, single word headwords and ‘short’ definitions, with the option to specify what is meant by ‘short’ in terms of the number of characters.  I updated the script to bring back all DOST entries that met these criteria and had definitions that were less than 1000 characters in length, which resulted in just under 18,000 rows being returned (but I will rerun the script with a smaller character count if Fraser wants to focus on shorter entries).  The script also stripped out all citations and tags from the definition to prepare it for POS tagging.  With this dataset exported as a CSV I then began experimenting with a POS Tagger.  I decided to use the Stanford POS Tagger (https://nlp.stanford.edu/software/tagger.html) which can be run at the command line, and I created a PHP script that went through each row of the CSV, passed the prepared definition text to the Tagger, pulled in the output and stored it in a database.  I left the process running overnight and it had completed the following morning.  I then outputted the rows as a spreadsheet and sent them on to Fraser for feedback.  Fraser also wanted to see about using the data from the Scots School Dictionary so I sent that on to him too.

I also did a little bit of work for the DSL, investigating why some geographically tagged information was not being displayed in the citations, and replied to a few emails from Heather Pagan of the Anglo-Norman Dictionary as she began to look into uploading new data to their existing and no longer supported dictionary management system.  I also gave some feedback on a proposal written by Rachel Douglas, a lecturer in French.  Although this is not within Critical Studies and should be something Luca Guariento looks at, he is currently on holiday so I offered to help out.  I also set up an initial WordPress site for Matthew Creasey’s new project.  This still needs some further work, but I’ll need further information from Matthew before I can proceed.  On Wednesday I met with Jennifer Smith and E Jamieson to discuss a possible follow-on project for the Scots Syntax Atlas.  We talked through some of the possibilities and I think the project has huge potential.  I’ll be helping to write the Data Management Plan and other such technical things for the proposal in due course.

I met with Marc and Fraser on Friday to discuss our plans for updating the way dates are stored in the Historical Thesaurus, which will make it much more easy to associate labels with specific dates and to update the dates in future as we align the data with revisions from the OED.  I’d previously written a script that generated the new dates and from these generated a new ‘full date’ field which I then matched against the original ‘full date’ to spot errors.  The script identified 1,116 errors, but this week I updated my script to change the way it handled ‘b’ dates.  These are the dates that appear after a slash and where the date after the slash is in the same decade as the main date only one digit should be displayed (e.g. 1975/6), but this is not done so consistently, with dates sometimes appearing as 1975/76.  Where this happened my script was noting the row as an error, but Marc wanted these to be ignored.  I updated my script to take this into consideration, and this has greatly reduced the number of rows that will need to be manually checked, reducing the output to just 284 rows.

I spent the rest of my time this week working on the Books and Borrowers project.  Although this doesn’t officially begin until the summer I’m designing the data structure at the moment (as time allows) so that when the project does start the RAs will have a system to work with sooner rather than later.  I mapped out all of the fields in the various sample datasets in order to create a set of ‘core’ fields, mapping the fields from the various locations to these ‘core’ fields.  I also designed a system for storing additional fields that may only be found at one or two locations, are not ‘core’ but still need to be recorded.  I then created the database schema needed to store the data in this format and wrote a document that details all of this which I sent to Katie Halsey and Matt Sangster for feedback.

Matt also sent me a new version of the Glasgow Student borrowings spreadsheet he had been working on, and I spent several hours on Friday getting this uploaded to the pilot online resource I’m working on. I experimented with a new method of extracting the data from Excel to try and minimise the number of rows that were getting garbled due to Excel’s horrible attempts to save files as HTML.  As previously documented, the spreadsheet uses formatting in a number of columns (e.g. superscript, strikethrough).  This formatting is lost if the contents of the spreadsheet are copied in a plain text way (so no saving as a CSV, or opening the file in Google Docs or just copying the contents).  The only way to extract the formatting in a way that can be used is to save the file as HTML in Excel and then work with that.  But the resulting HTML produced by Excel is awful, with hundreds of tags and attributes scattered across the file used in an inconsistent and seemingly arbitrary way.

For example, this is the HTML for one row:

<tr height=23 style=’height:17.25pt’>

<td height=23 width=64 style=’height:17.25pt;width:48pt’></td>

<td width=143 style=’width:107pt’>Charles Wilson</td>

<td width=187 style=’width:140pt’>Charles Wilson</td>

<td width=86 style=’width:65pt’>Charles</td>

<td width=158 style=’width:119pt’>Wilson<span

style=’mso-spacerun:yes’> </span></td>

<td width=88 style=’width:66pt’>Nat. Phil.</td>

<td width=129 style=’width:97pt’>Natural Philosophy</td>

<td width=64 style=’width:48pt’>B</td>

<td class=xl69 width=81 style=’width:61pt’>10</td>

<td class=xl70 width=81 style=’width:61pt’>3</td>

<td width=250 style=’width:188pt’>Wells Xenophon vol. 3<font class=”font6″><sup>d</sup></font></td>

<td width=125 style=’width:94pt’>Mr Smith</td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’>Adam Smith</td>

<td width=124 style=’width:93pt’></td>

<td width=124 style=’width:93pt’></td>

<td width=89 style=’width:67pt’>22 Mar 1757</td>

<td width=89 style=’width:67pt’>10 May 1757</td>

<td align=right width=56 style=’width:42pt’>2</td>

<td width=64 style=’width:48pt’>4r</td>

<td class=xl71 width=64 style=’width:48pt’>1</td>

<td class=xl70 width=64 style=’width:48pt’>007</td>

<td class=xl65 width=325 style=’width:244pt’><a

href=”http://eleanor.lib.gla.ac.uk/record=b1592956″ target=”_parent”>https://eleanor.lib.gla.ac.uk/record=b1592956</a></td>

<td width=293 style=’width:220pt’>Xenophon.</td>

<td width=392 style=’width:294pt’>Opera quae extant omnia; unà cum

chronologiâa Xenophonteâ <span style=’display:none’>cl. Dodwelli, et quatuor

tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</span></td>

<td width=110 style=’width:83pt’>Wells, Edward, 16<span style=’display:none’>67-1727.</span></td>

<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll

Bi2-g.19-23</td>

<td align=right width=64 style=’width:48pt’>1</td>

<td align=right width=121 style=’width:91pt’>1</td>

<td width=64 style=’width:48pt’>T111427</td>

<td width=64 style=’width:48pt’></td>

<td width=64 style=’width:48pt’></td>

<td width=64 style=’width:48pt’></td>

</tr>

Previously I tried to fix this by running through several ‘find and replace’ passes to try and strip out all of the rubbish, while retaining what I needed, which was <tr>, <td> and some formatting tags such as <sup> for superscript.

This time I found a regular expression that removes all attributes from HTML tags, so for example <td width=64 style=’width:48pt’> becomes <td> (see it here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag).  I could then pass the resulting contents of every <td> through PHP’s strip_tags function to remove any remaining tags that were not required (e.g. <span>) while specifying the tags to retain (e.g. <sup>).

 

This approach seemed to work very well until I analysed the resulting rows and realised that the columns of many rows were all out of synchronisation, meaning any attempt at programmatically extracting the data and inserting it into the correct field in the database would fail.  After some further research I realised that Excel’s save as HTML feature was to blame yet again.  Without there being any clear reason, Excel sometimes expands a cell into the next cell or cells if these cells are empty.  An example of this can be found above and I’ve extracted it here:

<td colspan=2 width=174 style=’mso-ignore:colspan;width:131pt’>Sp Coll Bi2-g.19-23</td>

The ‘colspan’ attribute means that the cell will stretch over multiple columns, in this case 2 columns, but elsewhere in the output file it was 3 and sometimes 4 columns.  Where this happens the following cells simply don’t appear in the HTML.  As my regular expression removed all attributes this ‘colspan’ was lost and the row ended up with subsequent cells in the wrong place.

Once I’d identified this I could update my script to check for the existence of ‘colspan’ before removing attributes, adding in the required additional empty cells as needed (so in the above case an extra <td></td>).

With all of this in place the resulting HTML was much cleaner.  Here is the above row after my script had finished:

<tr>

<td></td>

<td>Charles Wilson</td>

<td>Charles Wilson</td>

<td>Charles</td>

<td>Wilson</td>

<td>Nat. Phil.</td>

<td>Natural Philosophy</td>

<td>B</td>

<td>10</td>

<td>3</td>

<td>Wells Xenophon vol. 3<sup>d</sup></td>

<td>Mr Smith</td>

<td></td>

<td></td>

<td>Adam Smith</td>

<td></td>

<td></td>

<td>22 Mar 1757</td>

<td>10 May 1757</td>

<td>2</td>

<td>4r</td>

<td>1</td>

<td>007</td>

<td>https://eleanor.lib.gla.ac.uk/record=b1592956</td>

<td>Xenophon.</td>

<td>Opera quae extant omnia; unà cum

chronologiâa Xenophonteâ cl. Dodwelli, et quatuor

tabulis geographicis. [Edidit Eduardus Wells] / [Xenophon].</td>

<td>Wells, Edward, 16<span>67-1727.</td>

<td>Sp Coll

Bi2-g.19-23</td><td></td>

<td>1</td>

<td>1</td>

<td>T111427</td>

<td></td>

<td></td>

<td></td></tr>

I then updated my import script to pull in the new fields (e.g. normalised class and professor names), set it up so that it would not import any rows that had ‘Yes’ in the first column, and updated the database structure to accommodate the new fields too.  The upload process then ran pretty smoothly and there are now 8145 records in the system.  After that I ran the further scripts to generate dates, students, professors, authors, book names, book titles and classes and updated the front end as previously discussed.  I still have the old data stored in separate database tables as well, just in case we need it, but I’ve tested out the front-end and it all seems to be working fine to me.

 

Week Beginning 3rd February 2020

I worked on several different projects this week.  One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus.  Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels.  This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates.  I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database.  Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.

Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field.  Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.

The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say.  Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question.  I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration  (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels).  The remaining rows can mostly be split into the following groups:

  1. Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
  2. Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
  3. Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
  4. Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
  5. Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
  6. Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’.  This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date).  I tried fixing this but ended up breaking other things so this is something that will need manual intervention.  I don’t think it occurs very often, though.  It’s a shame the same symbol was used to mean two different things.

It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes.  Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields).  Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates.  After that we’ll be ready to generate the new date fields for real.

I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project.  I went through all of the sample data and compiled a list of all of the fields found in each.  This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets.  I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing.  With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.

I also continued to work on the Place-Names of Mull and Ulva project.  I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text.  I also began to work on the project’s API and front end.  By the end of the week I managed to get an ‘in development’ version of the quick search working.  Markers appear with labels and popups and you can change base map or marker type.  Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates).  The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.

A couple of weeks ago E Jamieson, the RA on the SCOSYA project noticed that the statistics pane in the Linguists Atlas didn’t scroll, meaning that if there was a lot of data in the pane it was getting cut off the bottom.  I found a bit of time to investigate this and managed to update the JavaScript to ensure the height of the pane was checked and updated when necessary whenever the pane is opened.

Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.

Week Beginning 27th January 2020

I divided most of my time between three projects this week.  For the Place-Names of Mull and Ulva my time was spent working with the GB1900 dataset.  On Friday last week I’d created a script that would go through the entire 2.5 million row CSV file and extract each entry, adding it to a database for more easy querying.  This process had finished on Monday, but unfortunately things had gone wrong during the processing.  I was using the PHP function ‘fgetcsv’ to extract the data a line at a time.  This splits the CSV up based on a delimiting character (in this case a comma) and adds each part into an array, thus allowing the data to be inserted into my database.  Unfortunately some of the data contained commas.  In such cases the data was enclosed in double quotes, which is the standard way of handling such things, and I had thought the PHP function would automatically handle this, but alas it didn’t, meaning whenever a comma appeared in the data the row was split up into incorrect chunks and the data was inserted incorrectly into the database.  After realising this I added another option to the ‘fgetcsv’ command to specify a character to be identified as the ‘enclosure’ character and set the script off running again.  It had completed the insertion by Wednesday morning, but when I came to query the database again I realised that the process had still gone wrong.  Further investigation revealed the cause to be the GB1900 CSV file itself, which was encoded with UCS-2 character encoding rather than the more usual UTF-8.  I’m not sure why the data was encoded in this way, as it’s not a current standard and it results in a much larger file size than using UTF-8.  It also meant that my script was not properly identifying the double quote characters, which is why my script failed a second time.  However, after identifying this issue I converted the CSV to UTF-8, picked out a section with commas in the data, tested my script, discovered things were working this time, and let the script loose on the full dataset yet again.  Thankfully it proved to be ‘third time lucky’ and all 2.5 million rows had been successfully inserted by Friday morning.

After that I was then able to extract all of the place-names for the three parishes we’re interested in, which is a rather more modest (3908 rows.  I then wrote another script that would take this data and insert it into the project’s place-name table.  The place-names are a mixture of Gaelic and English (e.g. ‘A’ Mhaol Mhòr’ is pretty clearly Gaelic while ‘Argyll Terrace’ is not) and for now I set the script to just add all place-names to the ‘English’ rather then ‘Gaelic’ field.  The script also inserts the latitude and longitude values from the GB1900 data, and associates the appropriate parish.  I also found a bit of code that takes latitude and longitude figures and generates a 6 figure OS grid reference from them.  I tested this out and it seemed pretty accurate, so I also added this to my script, meaning all names also have the grid reference field populated.

The other thing I tried to do was to grab the altitude for each name via the Google Maps service.  This proved to be a little tricky as the service blocks you if you make too many requests all at once.  Also, our server was blacklisting my computer for making too many requests in a short space of time too, meaning for a while afterwards I was unable to access any page on the site or the database.  Thankfully Arts IT Support managed to stop me getting blocked and I managed to set the script to query Google Maps at a rate that was acceptable to it, so I was able to grab the altitudes for all 3908 place-names (although 16 of them are at 0m so may look like it’s not worked for these).  I also added in a facility to upload, edit and delete one or more sound files for each place-name, together with optional captions for them in English and Gaelic.  Sound files must be in the MP3 format.

The second project I worked on this week was my redevelopment of the ‘Digital Humanities at Glasgow’ site.  I have now finished going through the database of DH projects, trimming away the irrelevant or broken ones and creating new banners, icons, screenshots, keywords and descriptions for the rest.  There are now 75 projects listed, including 15 that are currently set as ‘Showcase’ projects, meaning they appear in the banner slideshow and on the ‘showcase’ page.  I also changed the site header font and fixed an issue with the banner slideshow and images getting too small on narrow screens.  I’ve asked Marc Alexander and Lorna Hughes to give me some feedback on the new site and I hope to be able to launch it in two weeks or so.

My third major project of the week was the Historical Thesaurus.  Marc, Fraser and I met last Friday to discuss a new way of storing dates that I’ve been wanting to implement for a while, and this week I began sorting this out.  I managed to create a script that can process any date, including associating labels with the appropriate date.  Currently the script allows you to specify a category (or to load a random category) and the dates for all lexemes therein are then processed and displayed on screen.  As of yet nothing is inserted into the database.  I have also updated the structure of the (currently empty) dates table to remove the ‘date order’ field.  I have also changed all date fields to integers rather than varchars to ensure that ordering of the columns is handled correctly.  At last Friday’s meeting we discussed replacing ‘OE’ and ‘_’ with numerical values.  We had mentioned using ‘0000’ for OE, but I’ve realised this isn’t a good idea as ‘0’ can easily be confused with null.  Instead I’m using ‘1100’ for OE and ‘9999’ for ‘current’.  I’ve also updated the lexeme table to add in new fields for ‘firstdate’ and ‘lastdate’ that will be the cached values of the first and last dates stored in the new dates table.

The script displays each lexeme in a category with its ‘full date’ column.  It then displays what each individual entry in the new ‘date’ table would hold for the lexeme in boxes beneath this, and then finishes off with displaying what the new ‘firstdate’ and ‘lastdate’ fields would contain.  Processing all of the date variations turned out to be somewhat easier than it was for generating timeline visualisations, as the former can be treated as individual dates (an OE, a first, a mid, a last, a current) while the latter needed to transform the dates into ranges, meaning the script had to check how each individual date connected to the next, had to possibly us ‘b’ dates etc.

I’ve tested the script out and I have so far only encountered one issue, and that is there are 10 rows that have first dates and mid dates and last dates but instead of the ‘firmidcon’ field joining the first and the mid dates together the ‘firlastcon’ field is used instead.  Then the ‘midlascon’ field is used to join the mid date to the last.  This is an error as ‘firlastcon’ should not be used to join first and mid dates.  An example of this happening is htid 28903 in catid 8880 where the ‘full date’ is ‘1459–1642/7 + 1856’.  There may be other occasions where the wrong joining column has been used, but I haven’t checked for these so far.

After getting the script to sort out the dates I then began the look at labels.  I started off using the ‘label’ field in order to figure out where in the ‘full date’ the label appeared.  However, I noticed that where there are multiple labels these appear all joined together in the label field, meaning in such cases the contents of the label field will never be matched to any text in the ‘full date’ field.  E.g. htid 6463 has the full date ‘1611 Dict. + 1808 poet. + 1826 Dict.’ And the label field is ‘Dict. poet. Dict.’ which is no help at all.

Instead I abandoned the ‘label’ field and just used the ‘full date’ field.  Actually, I still use the ‘label’ field to check whether the script needs to process labels or not.  Here’s a description of the logic for working out where a label should be added:

The dates are first split up into their individual boxes.  Then, if there is a label for the lexeme I go through each date in turn.  I split the full date field and look at the part after the date.  I go through each character of this in turn.  If the character is a ‘+’ then I stop.  If I have yet to find label text (they all start with an a-z character) and the character is a ‘-‘ and the following character is a number then I stop.  Otherwise if the character is a-z I note that I’ve started the label.  If I’ve started the label and the current character is a number then I stop.  Otherwise I add the current character to the label and proceed to the next character until all remaining characters are processed or a ‘stop’ criteria is reached.  After that if there is any label text it’s added to the date.  This process seems to work.  I did, however, have to fix how labels applied to ‘current’ dates are processed.  For a current date my algorithm was adding the label to the final year and not the current date (stored as 9999) as the label is found after the final year and ‘9999’ isn’t found in the full date string.  I added a further check for ‘current’ dates after the initial label processing that moves labels from the penultimate date to the current date in such cases.

In addition to these three big projects I also had an email conversation with Jane Roberts about some issues she’d been having with labels in the Thesaurus of Old English, I liaised with Arts IT Support to get some server space set up for Rachel Smith and Ewa Wanat’s project, I gave some feedback on a job description for an RA for the Books and Borrowing project, helped Carole Hough with an issue with a presentation of the Berwickshire Place-names resource, gave the PI response for Thomas’s Iona project a final once-over, gave a report on the cost of Mapbox to Jennifer Smith for the SCOSYA project and arranged to meet Matthew Creasey next week to discuss his Decadence and Translation project.