Week Beginning 7th June 2021

This week I finished an initial version of the ‘Browse Textbase’ feature for the Anglo-Norman Dictionary. Processing the XML proved to be rather tricky as I couldn’t just use the old XSLT file as it included a lot of stuff that wasn’t needed in the new site (e.g. formatting headers and footers) and gave errors when plugged directly into the new system.  For these reasons I had to adapt the XSLT.  Also, I’d split up the full XML files into chunks for each page, resulting in more than 12,700 chunks.  However, the XML often included elements that extended across pages, and when the content was extracted on a per-page basis this led to an invalid XML structure, as some tags ended up missing their closing tags, or closed without featuring an opening tag.  XSLT only works on valid XML files so I needed to find a way to fix this tag issue.  After some Googling I discovered that there is a PHP extension called Tidy that can take an invalid XML file and fix it.  What this does is to strip out all tags that don’t have an opening or closing tag, which is exactly what I wanted.  I wrote a little script that used the extension, tested it successfully on a few files and then ran all of the 12,700 pages through it.

With a full set of valid XML page files I then began work on the XSLT to display the documents as required.  This has been a very laborious process as I needed to go through each of the 77 documents and check the layout for any issues, and fix these as they cropped up.  With more than 12,700 pages I couldn’t look at each individually, but instead I generally looked at every page of the front matter, and then a random selection of pages in the main body of the text, as generally the structure is more consistent here.  I think this approach has worked well as most formatting issues were to be found in the front matter (e.g. some tables were split across multiple pages and needed table tags to be inserted at the top and bottom).

With regards to the main body of the texts the largest challenge has been getting the explanatory notes to appear correctly, as these had been tagged in at least nine different ways throughout the documents, sometimes with entirely different XML structures and content.  One possible issue is that I dealt with new XML features as they cropped up as I worked through the books, but in dealing with these features I may have inadvertently messed up how things looked in earlier books.  One example that I thankfully spotted is that I wanted <bibl> tags to start on a new line as this would make the bibliographies easier to read, but other texts have the <bibl> tag mid-sentence and my change resulted in lines breaking where they shouldn’t.

There are some other issues that have cropped up that we may still need to address.  There are many spacing issues caused by whoever tagged the documents not leaving spaces between tags, or adding spaces between tags where there shouldn’t be spaces.  It’s a bit of a strange issue as it doesn’t seem to exhibit itself on the old site, but isn’t something that is dealt with by the scripts I have access to.  I don’t know if perhaps the texts were ‘fixed’ at some point and I just don’t have access to the fixed versions.  It’s not something that can be fixed automatically (at least not without coming up with a set of rules for fixing) as it’s not always the case that a tag should always have (or not have) a space after it.  Here are some examples, with the text as displayed before the colon and the XML after:

  1. ‘M cMoroug’: M <hi rend=”sup”>c</hi>Moroug
  2. ‘Lettres et pétitions( Legge’: <title lang=”FR” rend=”italic”>Lettres et p&#xE9;titions</title>( <editor>Legge</editor>
  3. ‘CDqui’: <title type=”MS”>CD</title>qui
  4. ‘( 17et 22)’: ( <ref target=”D1396_17″>17</ref>et <ref target=”D1396_22″>22</ref>)
  5. ‘n o2’: n <hi rend=”sup”>o</hi>2</ref>
  6. ‘Sire’: <hi rend=”bold”>S</hi>ire
  7. ‘T hepresent’: T <hi rend=”sc”>he</hi>present
  8. ‘Le xxx eiour ‘: Le xxx <hi rend=”sup”>e</hi>iour

Another issue is that the speed of loading a page is erratic.  Sometimes it’s instant, other times it takes several agonising seconds.  It’s really frustrating, and it’s not caused by my code.  I’m hoping when we get the new server (which we now have a quote for) this issue will resolve itself.  Also, Some of the pages are split at different points in two texts.  This must be due to the structure of the XML.  However, despite this all of the content is still included.                In addition, a couple of texts in the old system were broken – either the navigation just did not work or page contents were displaying multiple times.  I’m afraid I didn’t make a note of which these were, but they’re all sorted in the new system anyway.

There are currently some issues with footnote numbers due to all of the different ways these are tagged (sometimes with multiple ways being used on a single page).  Some examples:

  1. If multiple ways of tagging are used in the same page this can result in footnotes appearing out of order. This can be because some notes are <note> and others are <app>.  This is also causing some issue with the numbering as well (e.g. there are two [1] footnotes but the first listed should actually be [3].  This clearly needs some work, but I’m not sure how best to fix the issue.  On the old site notes of different types are given letters, but I’m not sure which letters to use for what, and if we want to continue using letters.
  2. In some places note numbers are being displayed where they weren’t previously being displayed. I’m not sure what should be done about this – I could for example add in an option to show / hide the notes.
  3. I’ve ensured all footnotes appear on a new line rather than having some that run on one line and others (sometimes in the same page) that have their own line.
  4. Sometimes an extended form of a footnote number appears where one didn’t previously (e.g. ‘[p2n5]’ rather than just ‘[5]’).
  5. Sometimes multiple notes appear straight after each other, and currently in such cases the numbering appears correctly in the text, but in the footnotes the first number in the line is duplicated. For example [2] and [3] in the text appear as [2] and [2] in the footnotes.

After spending a lot of time over the past two weeks working through the XML texts and wondering why the old site doesn’t display the spacing errors found in the texts I had access to, I did some further investigation into this.  It would appear that the old site uses different versions of the XML files to the ones I’ve been using.  I’m not sure why there are multiple versions of the XML files, but I’ve discovered that there are XML files in the ‘reduce’ folder that Heather gave me access to a couple of weeks ago, and these are different to the ones I have been using and must have been stored somewhere else on the server.

For example, the file ‘kingscouncil.xml’ that I have been using exhibits the spacing issue, see for example ‘M <hi rend=”sup”>c</hi>Moroug‘ and ‘xxx <hi rend=”sup”>e</hi>jour’ in this snippet:


<p> <hi lang=”LA” rend=”italic”>indorsacio</hi>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx <hi rend=”sup”>e</hi>jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme. <anchor id=”P4A1″ type=”note”/> <note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date>in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note>A tresreverent pere <anchor id=”P4A2″ type=”note”/> <note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note>&amp;c., comme desus.</p> <div n=”2″> <p> <note place=”omargin”> <date>A.D. 1392</date> </note>A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M <hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles</p> </div>


But in the ‘reduce’ folder there are two further versions of this (and all) textbase files.  One is named ‘kingscouncil.xml’ but is different to the one I’ve been using.  It has different TEIHeader data and doesn’t exhibit the spacing issue, see for example:


<p><hi lang=”LA” rend=”italic”>indorsacio</hi>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx<hi rend=”sup”>e</hi> jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme.<anchor id=”P4A1″ type=”note”/><note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date> in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note> A tresreverent pere<anchor id=”P4A2″ type=”note”/><note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note> &amp;c., comme desus.</p></div>

<div n=”2″><p><note place=”omargin”><date>A.D. 1392</date></note> A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M<hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles


Finally, there is a further version named ‘kingscouncil-apps.xml’ that appears to be just the text (no TEIHeader), again doesn’t exhibit the spacing issue, but in addition seems to use different tags in places.  See the tag around ‘indorsacio’, for example:


<p><term lang=”LA” rend=”i”>Indorsacio</term>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx<hi rend=”sup”>e</hi> jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme.<anchor id=”P4A1″ type=”note”/><note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date> in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note> A tresreverent pere<anchor id=”P4A2″ type=”note”/><note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note> &amp;c., comme desus.</p></div>

<div n=”2″><p><note place=”omargin”><date>A.D. 1392</date></note> A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M<hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles


So yet again the old site has me wanting to tear my hair out in exasperation at how badly organised, maintained and thought out it is.  It’s looking like I’ll have to replace all of the content I’ve been working on over the past couple of weeks with different versions.  But the question is which version?  Should it be the ‘apps’ version or the other version?  I realise now that the ‘apps’ version is referenced in the URLs used by the old site.  However, what is confusing is the ‘apps’ version doesn’t include the front-matter, but this is included in the old site, meaning it can’t be purely using the ‘apps’ version of the XML.  Even more strangely, the ‘kingscouncil.xml’ file in ‘reduce’ folder has a different structure to the version published on the old site, which is in fact closer to the version of the XML I have been using.  On the old site the first page begins:


“[p.xxvi]

INTRODUCTION.

[…]

Whether the Roll…”


But the ‘reduce’ version of ‘kingscouncil.xml’ includes two previous pages:


<pb n=”ix”/><div lang=”EN” type=”Introduction”><head>INTRODUCTION.</head>

<pb n=”xxv”/><p>It may be mentioned here that the folios are all mounted on linen guards, and that no part of the parchment has been inserted into the back, and none cut away at the fore-edge, top, or bottom, of the volume.</p>

<pb n=”xxvi”/><p>Whether the Roll…


Whereas the XML I’ve been using matches the published text:


<pb n=”xxvi” ed=”base”/><div lang=”EN” type=”Introduction”><head>INTRODUCTION.</head>

<p>[…]</p>

<p>Whether the Roll…


I had been intending to extract pages from the non-apps files in the ‘reduce’ folder and to present these alongside the existing pages in the front-end so the editors could look at them, but I’m encountering difficulties right from the start.  The first XML file in the data I originally had is ‘albus.xml’, which I expected to find as ‘albus-apps.xml’, yet there is no such file in the ‘reduce’ folder, nor a non-app ‘albus.xml’ file.  There are files called ‘libalbapp.xml’ and ‘libalbapp-apps.xml’, which would seem to correspond to the AND Source reference (Lib_Alb).  However, the contents of these files in no way correspond to the contents of the ‘albus.xml’ file I have and nor do they correspond to the text that is displayed on the old site at the above URL.

I can only conclude that there is yet another version of the files stored in another location that the old site uses.  It’s definitely not the same file as I have been using as the text on the old site has the spacing issue corrected.  I have done a ‘find in files’ for certain strings found in the ‘Albus’ text across all files in the ‘reduce’ folder and the text is definitely not found there.  It’s very confusing as the scripts suggest they are processing files only in this folder.  The script ‘and-getloc’ uses the variable ‘filename’ from the URL and passes this to the script ‘and-fetcher’ in the ‘reduce’ folder.  This in turn loads the file, finds and processes the required page.

As I was working through this I managed to figure things out.  It looks like I was right – there is yet another version of the files stored somewhere else that the old system actually uses.  Buried towards the end of the ‘and-fetcher’ script is this:

##############################################

## TODO !!!!

## HARDCODED TEXTS LOCATION HERE!

## SHIFT THIS TO CONSTANTS SYSTEM!!!

##

my $textpath = “/and/reduce/ready1/$text”;

##

##############################################

So the texts that are used are in a folder called ‘ready1’ within the ‘reduce’ folder.  However, there were no subfolders in the zip file of the ‘reduce’ folder that Heather sent me a couple of weeks ago.  If we can somehow track down this fourth(!) version of the files then perhaps I’ll be able to make some progress.  Heather managed to get access to the server again and located the additional folder, which did indeed include yet another version of the XML files.  It looks like this fourth version is the correct version.  It would appear to be the files that appear on the old website, including correction of spacings and all front matter (despite all files ending in ‘apps’, whereas the other ‘apps’ versions didn’t include the front matter).  Looking at the files discussed above:

The file ‘albus-apps.xml’ is present and includes all front-matter the same as both the file I was previously working with and the old site, but with spacing issues fixed.  The file ‘kingscouncil-apps’ also appears to be structurally identical to the ‘kingscouncil’ file I was originally working with (unlike the other two versions in ‘reduce’) and has the spacing issues fixed (e.g. M<hi rend=”sup”>c</hi>Moroug).

So now I’ll be able to begin again with the process I started a couple of weeks ago.  It’s going to take some time again, although hopefully most of the XSLT issues will be the same as before and will already be sorted.

Also this week I read through the bib documentation for Craig Lamont’s project and had a chat with him about a data management plan, which I’ll have to work on next week.  I also fixed a couple of issues on the SCOCO website for Matthew Creasy and spoke to Mike Black about the quote for a new server, which will hopefully be purchased soon.  I gave some advice to Katie Halsey about file formats and data transfer options for a new digitisation unit that will be working with the Books and Borrowing project, and also spent some time trying to sort out access to the server at Stirling for this project as it turned out that my access privileges had been removed midway through last month.

I also fixed an issue with the bibliography search on the new DSL website.  This was occurring when a search for ‘author or title’ was performed, which prefixes ‘Author: ‘ or ‘Title: ‘ to each entry in the autocomplete to help users differentiate between the two.  Selecting from the autocomplete list ran the search fine as this was based on the bibliographical ID hidden in the autocomplete, but if you pressed the ‘search’ button before the event was fired the search was looking for the full contents of the box – i.e. looking for authors and titles that begin with ‘Author: ‘ or ‘Title: ‘.  This was also happening if you pressed the browser’s back button from the results as the textbox would still then contain the full text.  I fixed this issue.  So it’s been a pretty busy week.