Week Beginning 4th October 2021

I spent a fair amount of time on the new ‘Speak for Yersel’ project this week, reading through materials produced by similar projects, looking into ArcGIS Online as a possible tool to use to create the map-based interface and thinking through some of the technical challenges the project will face.  I also participated in a project Zoom call on Thursday where we discussed the approaches we might take and clarified the sorts of outputs the project intends to produce.

I also had further discussions with the Sofia from the Iona place-names project about their upcoming conference in December and how the logistics for this might work, as it’s going to be an online-only conference.  I had a Zoom call with Sofia on Thursday to go through these details, which really helped us to shape up a plan.  I also dealt with a request from another project that wants to set up a top-level ‘ac.uk’ domain, which makes three over the past couple of weeks, and make a couple of tweaks to the text of the Decadence and Translation website.

I had a chat with Mike Black about the new server that Arts IT Support are currently setting up for the Anglo-Norman Dictionary and had a chat with Eleanor Lawson about adding around 100 or so Gaelic videos to the Seeing Speech resource on a new dedicated page.

For the Books and Borrowing project I was sent a batch of images of a register from Dumfries Presbytery Library and I needed to batch process them in order to fix the lighting levels and rename them prior to upload.  It took me a little time to figure out how to run a batch process in the ancient version of Photoshop I have.  After much hopeless Googling I found some pages from ‘Photoshop CS2 For Dummies’ on Google Books that discussed Photoshop Actions (see https://books.google.co.uk/books?id=RLOmw2omLwgC&lpg=PA374&dq=&pg=PA332#v=onepage&q&f=false) which made me realise the ‘Actions’, which I’d failed to find in any of the menus, were available via the tabs on the right of the screen, and I could ‘record’ and action via this.  After running the images through the batch I uploaded them to the server and generated the page records for each corresponding page in the register.

I spent the rest of the week working on the Anglo-Norman Dictionary, considering how we might be able to automatically fix entries with erroneous citation dates caused by a varlist being present in the citation with a different date that should be used instead of the main citation date.  I had been wondering whether we could use a Levenshtein test (https://en.wikipedia.org/wiki/Levenshtein_distance) to automatically ascertain which citations may need manual editing, or even as a means of automatically adding in the new tags after testing.  I can already identify all entries that feature a varlist, so I can create a script that can iterate through all citations that have a varlist in each of these entries. If we can assume that the potential form in the main citation always appears as the word directly before the varlist then my script can extract this form and then each <ms_form> in the <varlist>.  I can also extract all forms listed in the <head> of the XML.

So for example for https://anglo-norman.net/entry/babeder my script would extract the term ‘gabez’ from the citation as it is the last word before <varlist>.  It would then extract ‘babedez’ and ‘bauboiez’ from the <varlist>.  There is only one form for this entry: <lemma>babeder</lemma> so this would get extracted too.  The script would then run a Levenshtein test on each possible option, comparing them to the form ‘babeder’, the results of which would be:

gabez: 4

babedez: 1

bauboiez: 4

The script would then pick out ‘babedez’ as the form to use (only one character different to the form ‘babeder’) and would then update the XML to note that the date from this <ms_form> is the one that needs to be used.

With a more complicated example such as https://anglo-norman.net/entry/bochet_1 that has multiple forms in <head> the test would be run against each and the lowest score for each variant would be used.  So for example for the citation where ‘buchez’ is the last word before the <varlist> the two <ms_form> words would be extracted (huchez and buistez) and these plus ‘buchez’ would be compared against every form in <head>, with the overall lowest Leveshtein score getting logged.  The overall calculations in this case would be:

buchez:

bochet = 2

boket = 4

bouchet = 2

bouket = 4

bucet = 2

buchet = 1

buket = 3

bokés = 5

boketes = 5

bochésç = 6

buchees = 2

huchez:

bochet = 3

boket = 5

bouchet = 3

bouket = 5

bucet = 3

buchet = 2

buket = 4

bokés = 6

boketes = 6

bochésç = 7

buchees = 3

buistez:

bochet = 5

boket = 5

bouchet = 5

bouket = 5

bucet = 4

buchet = 4

buket = 4

bokés = 6

boketes = 4

bochésç = 8

buchees = 4

Meaning ‘buchez’ would win with a score of 1 and in this case no <varlist> form would therefore be marked.  If the main citation form and a varlist form both have the same lowest score then I guess we’d set it to the main citation form ‘winning’, although in such cases the citation could be flagged for manual checking.  However, this algorithm does entirely depend on the main citation form being the word before the <varlist> tag and the editor confirmed that this is not always the case, but despite this I think the algorithm could correctly identify the majority of cases, and if the output was placed in a CSV it would then be possible for someone to quickly check through each citation and tick off those that should be automatically updated and manually fix the rest.  I made a start on the script that would work through all of the entries and output the CSV during the remainder of the week, but didn’t have the time to finish it.  I’m going to be on holiday next week but will continue with this when I return.