I then spent some time investigating why part of speech in the <senseInfo> element of senses sometimes used underscores and other times used spaces. This discrepancy was messing up the numbering of senses, as this depends on the POS, with the number resetting to 1 when a new POS is encountered. If the POS is sometimes recorded as ‘p.p._as_a.’ (for example) and other times as ‘p.p. as a.’ then the code thinks these are different parts of speech and resets the counter to 1. I looked at the DTD, which sets the rules for creating or editing the XML files and it uses the underscore form of POS. However, this rule only applies to the ‘type’ attribute of the <pos> element and not to the ‘pos’ attribute of the <senseInfo> element. After investigating it turned out that these ‘pos’ attributes that the numbering system relies on are not manually added in by the editors, but are added in by my scripts at the point of upload. The reason I set up my script to add these in is because the old systems also added these in automatically during the conversion of the editors’ XML into the XML published via the old Dictionary Management System. However, this old system refactored the POS, replacing underscores with spaces and thus storing two different formats of POS within the XML. My upload scripts didn’t do this but instead kept things consistent, and this meant that when an entry was edited to add a new sense the new sense was added with the underscore form of POS, but the existing senses still had the space form of POS.
There were two possible ways I could fix this, I could either write a script that regenerates the <senseInfo> pos for every sense and subsense in every entry, replacing all existing ‘pos’ with the value of the preceding <pos type=””> (i.e. removing all old space forms of POS and ensuring all POS references were consistent); or I could adapt my upload script so that the assignment of <senseInfo> pos treats both ‘underscore’ and ‘space’ versions as the same. I decided on the former approach and wrote a script to first identify and then update all of the dictionary entries.
The script goes through each entry and finds all that have a <senseInfo> pos with a space in. There are 2,538 such entries. I then adapted the script so that for each <senseInfo> in an entry all spaces are changed to underscores and the result is then compared with the preceding <pos> type. I set the script to output content if there was a mismatch between the <senseInfo> pos and the <pos> type, because when I set the script to update it will use the value from <pos> type, so as to ensure consistency. The script identified 41 entries where there was a mismatch between <senseInfo> pos and the preceding <pos> type. These were often due to a question mark being added to the <senseInfo> pos, e.g. ‘a. as s. ?’ vs ‘a._as_s._’, but there were also some where the POS was completely different, e.g. ‘sbst. inf.’ and ‘v.n.’. I spoke to the editor Geert about this and it turned out that these were due to a locution being moved in the XML without having the pos value updated. Geert fixed these and I ran the update to bring all of the POS references into alignment.
My final AND task was to look into some issues regarding the variant and deviant section of the entry (where alternative forms of the headword are listed). Legiturs in this section were not getting displayed, plus there were several formatting issues that needed addressed, such as brackets not appearing in the right place and line breaks not worked as they should. This was a very difficult task to tackle as there is so much variety to the structure of this section, and the XML is not laid out in the most logical of manners, for example references are not added as part of a <variant> or <deviant> tag but are added after the corresponding tag as a sibling <varref> element. This really complicates navigating through the variants and deviants as there may be any number of varrefs at the same level. However, I managed to address the issues with this section, ensuring the legiturs appeared, repositioning semi-colons outside of the <deviant> brackets, ensuring line breaks always occur when a new POS is encountered and don’t occur anywhere else, ensuring multiple occurrences of the same POS label don’t get displayed and fixing the issue with double closing brackets sometimes appearing. It’s likely that there will be other issues with this section, as the content and formatting is so varied, but for now that’s all issues sorted.
The only other project I worked for this week was the Iona place-names project, for which I helped the RA Sofia with the formatting of this month’s ‘name of the month’ feature (https://iona-placenames.glasgow.ac.uk/names-of-the-month/). Next week I’ll continue with the outstanding AND tasks, of which there are still several.