I worked on several different projects this week. One of the major tasks I tackled was to continue with the implementation of a new way of recording dates for the Historical Thesaurus. Last week I created a script that generated dates in the new format for a specified (or random) category, including handling labels. This week I figured that we would also need a method to update the fulldate field (i.e. the full date as a text string, complete with labels etc that is displayed on the website beside the word) based on any changes that are subsequently made to dates using the new system, so I updated the script to generate a new fulldate field using the values that have been created during the processing of the dates. I realised that if this newly generated fulldate field is not exactly the same as the original fulldate field then something has clearly gone wrong somewhere, either with my script or with the date information stored in the database. Where this happens I added the text ‘full date mismatch’ with a red background at the end of the date’s section in my script.
Following on from this I created a script that goes through every lexeme in the database, temporarily generates the new date information and from this generates a new fulldate field. Where this new fulldate field is not an exact match for the original fulldate field the lexeme is added to a table, which I then saved as a spreadsheet.
The spreadsheet contains 1,116 rows containing lexemes that have problems with their dates, which out of 793,733 lexemes is pretty good going, I’d say. Each row includes a link to the category on the website and the category name, together with the HTID, word, original fulldate, generated fulldate and all original date fields for the lexeme in question. I spent several hours going through previous, larger outputs and fixing my script to deal with a variety of edge cases that were not originally taken into consideration (e.g. purely OE dates with labels were not getting processed and some ‘a’ and ‘c’ dates were confusing the algorithm that generated labels). The remaining rows can mostly be split into the following groups:
- Original and generated fulldate appear to be identical but there must be some odd invisible character encoding issue that is preventing them being evaluated as identical. E.g. ‘1513(2) Scots’ and ‘1513(2) Scots’.
- Errors in the original fulldate. E.g. ‘OE–1614+ 1810 poet.’ doesn’t have a gap between the plus and the preceding number, another lexeme has ‘1340c’ instead of ‘c1340’
- Corrections made to the original fulldate that were not replicated in the actual date columns, E,g, ‘1577/87–c1630’ has a ‘c’ in the fulldate but this doesn’t appear in any of the ‘dac’ fields, and a lexeme has the date ‘c1480 + 1485 + 1843’ but the first ‘+’ is actually stored as a ‘-‘ in the ‘con’ column.
- Inconsistent recording of the ‘b’ dates where a ‘b’ date in the same decade does not appear as a single digit but as two digits. There are lots of these, e.g. ‘1430/31–1630’ should really be ‘1430/1–1630’ following the convention used elsewhere.
- Occasions where two identical dates appear with a label after the second date, resulting in the label not being found, as the algorithm finds the first instance of the date with no label after it. E,g, a lexeme with the fulldate ‘1865 + 1865 rare’.
- Any dates that have a slash connector and a label associated with the date after the slash end up with the label associated with the date before the slash too. E.g. ‘1731/1800– chiefly Dict.’. This is because the script can’t differentiate between a slash used to split a ‘b’ date (in which case a following label ‘belongs’ to the date before the slash) and a slash used to connect a completely different date (in which case the label ‘belongs’ to the other date). I tried fixing this but ended up breaking other things so this is something that will need manual intervention. I don’t think it occurs very often, though. It’s a shame the same symbol was used to mean two different things.
It’s now down to some manual fixing of these rows, probably using the spreadsheet to make any required changes. Another column could be added to note where no changes to the original data are required and then for the remainder make any changes that are necessary (e.g. fixing the original first date, or any of the other date fields). Once that’s done I will be able to write a script that will take any rows that need updating and perform the necessary updates. After that we’ll be ready to generate the new date fields for real.
I also spent some time this week going through the sample data that Katie Halsey had sent me from a variety of locations for the Books and Borrowing project. I went through all of the sample data and compiled a list of all of the fields found in each. This is a first step towards identifying a core set of fields and of mapping the analogous fields across different datasets. I also included the GU students and professors from Matthew’s pilot project but I have not included anything from the images from Inverness as deciphering the handwriting in the images is not something I can spend time doing. With this mapping document in place I can now think about how best to store the different data recorded at the various locations in a way that will allow certain fields to be cross-searched.
I also continued to work on the Place-Names of Mull and Ulva project. I copied all of the place-names taken from the GB1900 data to the Gaelic place-name field, added in some former parishes and updated the Gaelic classification codes and text. I also began to work on the project’s API and front end. By the end of the week I managed to get an ‘in development’ version of the quick search working. Markers appear with labels and popups and you can change base map or marker type. Currently only ‘altitude’ categorisation gives markers that are differentiated from each other, as there is no other data yet (e.g. classification, dates). The links through to the ‘full record’ also don’t currently work, but it is handy to have the maps to be able to visualise the data.
Also this week I had a further email conversation with Heather Pagan about the Anglo-Norman Dictionary, spoke to Rhona Alcorn about a new version of the Scots School Dictionary app, met with Matthew Creasey to discuss the future of his Decadence and Translation Network recourse and a new project of his that is starting up soon, responded to a PhD student who had asked me for some advice about online mapping technologies, arranged a coffee meeting for the College of Arts Developers and updated the layout of the video page of SCOSYA.