Week Beginning 7th June 2021

This week I finished an initial version of the ‘Browse Textbase’ feature for the Anglo-Norman Dictionary. Processing the XML proved to be rather tricky as I couldn’t just use the old XSLT file as it included a lot of stuff that wasn’t needed in the new site (e.g. formatting headers and footers) and gave errors when plugged directly into the new system.  For these reasons I had to adapt the XSLT.  Also, I’d split up the full XML files into chunks for each page, resulting in more than 12,700 chunks.  However, the XML often included elements that extended across pages, and when the content was extracted on a per-page basis this led to an invalid XML structure, as some tags ended up missing their closing tags, or closed without featuring an opening tag.  XSLT only works on valid XML files so I needed to find a way to fix this tag issue.  After some Googling I discovered that there is a PHP extension called Tidy that can take an invalid XML file and fix it.  What this does is to strip out all tags that don’t have an opening or closing tag, which is exactly what I wanted.  I wrote a little script that used the extension, tested it successfully on a few files and then ran all of the 12,700 pages through it.

With a full set of valid XML page files I then began work on the XSLT to display the documents as required.  This has been a very laborious process as I needed to go through each of the 77 documents and check the layout for any issues, and fix these as they cropped up.  With more than 12,700 pages I couldn’t look at each individually, but instead I generally looked at every page of the front matter, and then a random selection of pages in the main body of the text, as generally the structure is more consistent here.  I think this approach has worked well as most formatting issues were to be found in the front matter (e.g. some tables were split across multiple pages and needed table tags to be inserted at the top and bottom).

With regards to the main body of the texts the largest challenge has been getting the explanatory notes to appear correctly, as these had been tagged in at least nine different ways throughout the documents, sometimes with entirely different XML structures and content.  One possible issue is that I dealt with new XML features as they cropped up as I worked through the books, but in dealing with these features I may have inadvertently messed up how things looked in earlier books.  One example that I thankfully spotted is that I wanted <bibl> tags to start on a new line as this would make the bibliographies easier to read, but other texts have the <bibl> tag mid-sentence and my change resulted in lines breaking where they shouldn’t.

There are some other issues that have cropped up that we may still need to address.  There are many spacing issues caused by whoever tagged the documents not leaving spaces between tags, or adding spaces between tags where there shouldn’t be spaces.  It’s a bit of a strange issue as it doesn’t seem to exhibit itself on the old site, but isn’t something that is dealt with by the scripts I have access to.  I don’t know if perhaps the texts were ‘fixed’ at some point and I just don’t have access to the fixed versions.  It’s not something that can be fixed automatically (at least not without coming up with a set of rules for fixing) as it’s not always the case that a tag should always have (or not have) a space after it.  Here are some examples, with the text as displayed before the colon and the XML after:

  1. ‘M cMoroug’: M <hi rend=”sup”>c</hi>Moroug
  2. ‘Lettres et pétitions( Legge’: <title lang=”FR” rend=”italic”>Lettres et p&#xE9;titions</title>( <editor>Legge</editor>
  3. ‘CDqui’: <title type=”MS”>CD</title>qui
  4. ‘( 17et 22)’: ( <ref target=”D1396_17″>17</ref>et <ref target=”D1396_22″>22</ref>)
  5. ‘n o2’: n <hi rend=”sup”>o</hi>2</ref>
  6. ‘Sire’: <hi rend=”bold”>S</hi>ire
  7. ‘T hepresent’: T <hi rend=”sc”>he</hi>present
  8. ‘Le xxx eiour ‘: Le xxx <hi rend=”sup”>e</hi>iour

Another issue is that the speed of loading a page is erratic.  Sometimes it’s instant, other times it takes several agonising seconds.  It’s really frustrating, and it’s not caused by my code.  I’m hoping when we get the new server (which we now have a quote for) this issue will resolve itself.  Also, Some of the pages are split at different points in two texts.  This must be due to the structure of the XML.  However, despite this all of the content is still included.                In addition, a couple of texts in the old system were broken – either the navigation just did not work or page contents were displaying multiple times.  I’m afraid I didn’t make a note of which these were, but they’re all sorted in the new system anyway.

There are currently some issues with footnote numbers due to all of the different ways these are tagged (sometimes with multiple ways being used on a single page).  Some examples:

  1. If multiple ways of tagging are used in the same page this can result in footnotes appearing out of order. This can be because some notes are <note> and others are <app>.  This is also causing some issue with the numbering as well (e.g. there are two [1] footnotes but the first listed should actually be [3].  This clearly needs some work, but I’m not sure how best to fix the issue.  On the old site notes of different types are given letters, but I’m not sure which letters to use for what, and if we want to continue using letters.
  2. In some places note numbers are being displayed where they weren’t previously being displayed. I’m not sure what should be done about this – I could for example add in an option to show / hide the notes.
  3. I’ve ensured all footnotes appear on a new line rather than having some that run on one line and others (sometimes in the same page) that have their own line.
  4. Sometimes an extended form of a footnote number appears where one didn’t previously (e.g. ‘[p2n5]’ rather than just ‘[5]’).
  5. Sometimes multiple notes appear straight after each other, and currently in such cases the numbering appears correctly in the text, but in the footnotes the first number in the line is duplicated. For example [2] and [3] in the text appear as [2] and [2] in the footnotes.

After spending a lot of time over the past two weeks working through the XML texts and wondering why the old site doesn’t display the spacing errors found in the texts I had access to, I did some further investigation into this.  It would appear that the old site uses different versions of the XML files to the ones I’ve been using.  I’m not sure why there are multiple versions of the XML files, but I’ve discovered that there are XML files in the ‘reduce’ folder that Heather gave me access to a couple of weeks ago, and these are different to the ones I have been using and must have been stored somewhere else on the server.

For example, the file ‘kingscouncil.xml’ that I have been using exhibits the spacing issue, see for example ‘M <hi rend=”sup”>c</hi>Moroug‘ and ‘xxx <hi rend=”sup”>e</hi>jour’ in this snippet:


<p> <hi lang=”LA” rend=”italic”>indorsacio</hi>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx <hi rend=”sup”>e</hi>jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme. <anchor id=”P4A1″ type=”note”/> <note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date>in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note>A tresreverent pere <anchor id=”P4A2″ type=”note”/> <note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note>&amp;c., comme desus.</p> <div n=”2″> <p> <note place=”omargin”> <date>A.D. 1392</date> </note>A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M <hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles</p> </div>


But in the ‘reduce’ folder there are two further versions of this (and all) textbase files.  One is named ‘kingscouncil.xml’ but is different to the one I’ve been using.  It has different TEIHeader data and doesn’t exhibit the spacing issue, see for example:


<p><hi lang=”LA” rend=”italic”>indorsacio</hi>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx<hi rend=”sup”>e</hi> jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme.<anchor id=”P4A1″ type=”note”/><note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date> in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note> A tresreverent pere<anchor id=”P4A2″ type=”note”/><note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note> &amp;c., comme desus.</p></div>

<div n=”2″><p><note place=”omargin”><date>A.D. 1392</date></note> A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M<hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles


Finally, there is a further version named ‘kingscouncil-apps.xml’ that appears to be just the text (no TEIHeader), again doesn’t exhibit the spacing issue, but in addition seems to use different tags in places.  See the tag around ‘indorsacio’, for example:


<p><term lang=”LA” rend=”i”>Indorsacio</term>. Eient les supplians la garde et la mariage dedens cestes contenues, selonc la purport de ceste peticion, pour xx. s. vi. d. appaier en le Haneper pur le fyn, par les lettres patentes notre Seignour le Roy souz son grant seel en Irland en due fourme. Doune a Dyvelyn le xxx<hi rend=”sup”>e</hi> jour Doctobre, lan notre dit Seignour le Roy Richard Seconde seszisme.<anchor id=”P4A1″ type=”note”/><note place=”foot” target=”P4A1″>The <date>30th of October 1392</date>. The regnal years of this king commenced on the <date>22nd of June</date> in each year. Here, and elsewhere throughout the Roll, the year of the present style is used, but no rectification of the day of the month has been attempted.</note> A tresreverent pere<anchor id=”P4A2″ type=”note”/><note place=”foot” target=”P4A2″>As letters patent were to issue, Robert Archbishop of Dublin, Chancellor of Ireland, must have been the person here addressed. See enrolment No. 15, <hi rend=”italic”>infra</hi>.</note> &amp;c., comme desus.</p></div>

<div n=”2″><p><note place=”omargin”><date>A.D. 1392</date></note> A tresnobles Justice et Consel notre Seignour le Roy en Irland supplie Johan Creef de Ballaghmoun, que comme sa ville, sa mansion, ses blees et diverses autres benes furent arses, degastes et destruys par M<hi rend=”sup”>c</hi>Moroug et autres Irrois enemys notre Seignour le Roy, comme est comme est cognuz et notifie a vous, tresnobles


So yet again the old site has me wanting to tear my hair out in exasperation at how badly organised, maintained and thought out it is.  It’s looking like I’ll have to replace all of the content I’ve been working on over the past couple of weeks with different versions.  But the question is which version?  Should it be the ‘apps’ version or the other version?  I realise now that the ‘apps’ version is referenced in the URLs used by the old site.  However, what is confusing is the ‘apps’ version doesn’t include the front-matter, but this is included in the old site, meaning it can’t be purely using the ‘apps’ version of the XML.  Even more strangely, the ‘kingscouncil.xml’ file in ‘reduce’ folder has a different structure to the version published on the old site, which is in fact closer to the version of the XML I have been using.  On the old site the first page begins:


“[p.xxvi]

INTRODUCTION.

[…]

Whether the Roll…”


But the ‘reduce’ version of ‘kingscouncil.xml’ includes two previous pages:


<pb n=”ix”/><div lang=”EN” type=”Introduction”><head>INTRODUCTION.</head>

<pb n=”xxv”/><p>It may be mentioned here that the folios are all mounted on linen guards, and that no part of the parchment has been inserted into the back, and none cut away at the fore-edge, top, or bottom, of the volume.</p>

<pb n=”xxvi”/><p>Whether the Roll…


Whereas the XML I’ve been using matches the published text:


<pb n=”xxvi” ed=”base”/><div lang=”EN” type=”Introduction”><head>INTRODUCTION.</head>

<p>[…]</p>

<p>Whether the Roll…


I had been intending to extract pages from the non-apps files in the ‘reduce’ folder and to present these alongside the existing pages in the front-end so the editors could look at them, but I’m encountering difficulties right from the start.  The first XML file in the data I originally had is ‘albus.xml’, which I expected to find as ‘albus-apps.xml’, yet there is no such file in the ‘reduce’ folder, nor a non-app ‘albus.xml’ file.  There are files called ‘libalbapp.xml’ and ‘libalbapp-apps.xml’, which would seem to correspond to the AND Source reference (Lib_Alb).  However, the contents of these files in no way correspond to the contents of the ‘albus.xml’ file I have and nor do they correspond to the text that is displayed on the old site at the above URL.

I can only conclude that there is yet another version of the files stored in another location that the old site uses.  It’s definitely not the same file as I have been using as the text on the old site has the spacing issue corrected.  I have done a ‘find in files’ for certain strings found in the ‘Albus’ text across all files in the ‘reduce’ folder and the text is definitely not found there.  It’s very confusing as the scripts suggest they are processing files only in this folder.  The script ‘and-getloc’ uses the variable ‘filename’ from the URL and passes this to the script ‘and-fetcher’ in the ‘reduce’ folder.  This in turn loads the file, finds and processes the required page.

As I was working through this I managed to figure things out.  It looks like I was right – there is yet another version of the files stored somewhere else that the old system actually uses.  Buried towards the end of the ‘and-fetcher’ script is this:

##############################################

## TODO !!!!

## HARDCODED TEXTS LOCATION HERE!

## SHIFT THIS TO CONSTANTS SYSTEM!!!

##

my $textpath = “/and/reduce/ready1/$text”;

##

##############################################

So the texts that are used are in a folder called ‘ready1’ within the ‘reduce’ folder.  However, there were no subfolders in the zip file of the ‘reduce’ folder that Heather sent me a couple of weeks ago.  If we can somehow track down this fourth(!) version of the files then perhaps I’ll be able to make some progress.  Heather managed to get access to the server again and located the additional folder, which did indeed include yet another version of the XML files.  It looks like this fourth version is the correct version.  It would appear to be the files that appear on the old website, including correction of spacings and all front matter (despite all files ending in ‘apps’, whereas the other ‘apps’ versions didn’t include the front matter).  Looking at the files discussed above:

The file ‘albus-apps.xml’ is present and includes all front-matter the same as both the file I was previously working with and the old site, but with spacing issues fixed.  The file ‘kingscouncil-apps’ also appears to be structurally identical to the ‘kingscouncil’ file I was originally working with (unlike the other two versions in ‘reduce’) and has the spacing issues fixed (e.g. M<hi rend=”sup”>c</hi>Moroug).

So now I’ll be able to begin again with the process I started a couple of weeks ago.  It’s going to take some time again, although hopefully most of the XSLT issues will be the same as before and will already be sorted.

Also this week I read through the bib documentation for Craig Lamont’s project and had a chat with him about a data management plan, which I’ll have to work on next week.  I also fixed a couple of issues on the SCOCO website for Matthew Creasy and spoke to Mike Black about the quote for a new server, which will hopefully be purchased soon.  I gave some advice to Katie Halsey about file formats and data transfer options for a new digitisation unit that will be working with the Books and Borrowing project, and also spent some time trying to sort out access to the server at Stirling for this project as it turned out that my access privileges had been removed midway through last month.

I also fixed an issue with the bibliography search on the new DSL website.  This was occurring when a search for ‘author or title’ was performed, which prefixes ‘Author: ‘ or ‘Title: ‘ to each entry in the autocomplete to help users differentiate between the two.  Selecting from the autocomplete list ran the search fine as this was based on the bibliographical ID hidden in the autocomplete, but if you pressed the ‘search’ button before the event was fired the search was looking for the full contents of the box – i.e. looking for authors and titles that begin with ‘Author: ‘ or ‘Title: ‘.  This was also happening if you pressed the browser’s back button from the results as the textbox would still then contain the full text.  I fixed this issue.  So it’s been a pretty busy week.

Week Beginning 31st May 2021

It was the late May bank holiday on Monday, so this was a four-day week.  On Tuesday I decided to try working at my office at the University – my first full day back at my office since the first lockdown began.  All went very smoothly; I didn’t meet anyone in the building and it seemed very quiet on campus generally.  The only issue was the number of updates my computer had to install, which caused some delays.  I’m probably going to try and come back to work on Tuesdays on a semi-regular basis now to see how things go.

I had some discussions with Marc and Arts IT Support this week about the possibility of purchasing a new server, and some progress is being made there.  I also responded to a query regarding the Scots Syntax Atlas that Jennifer Smith forwarded on to me and spoke to Roslyn Potter about a project that a lecturer in History is needing a website for.

Other than these tasks I spent the week continuing to work on the Textbase feature of the Anglo-Norman Dictionary.  Last week I’d left off with the infrastructure in place to browse texts, display the raw XML of pages and navigate between pages.  My task for this week was to ensure that the XML displayed properly.  This proved to be rather tricky as although I had managed to get access to the XSLT file that the Textbase on the old site used to transform the XML to HTML, it included a lot of stuff that wasn’t needed in the new site (e.g. formatting headers and footers) and also gave errors when plugged directly into the new system.  For these reasons I had to adapt the XSLT.  Also, I’d split up the full XML files into chunks for each page, resulting in more than 12,000 chunks.  However, the XML often included elements that extended across pages, and when the content was extracted on a per-page basis this led to an invalid XML structure, as some tags ended up missing their closing tags, or closed without featuring an opening tag.  XSLT only works on valid XML files so I needed to fid a way to fix this tag issue.  After some Googling I discovered that there is a PHP extension called Tidy (https://www.php.net/manual/en/intro.tidy.php) that can take an invalid XML file and fix it.  What this does is to strip out all tags that don’t have an opening or closing tag, which is exactly what I wanted.  I wrote a little script that used the extension, tested it successfully on a few files and then ran all of the 12,000 pages through it.

With a full set of valid XML page files I then began work on the XSL to display the documents as required.  This has been a very laborious process as I needed to go through each of the more than 70 documents and check the layout for any issues, and fix these as they cropped up.  With more than 12,000 pages I couldn’t look at each individually, but instead took a random selection, a process that’s is working pretty well so far.  The largest challenge was getting the explanatory notes to appear correctly, as these had been tagged in at least eight different ways throughout the documents, sometimes with entirely different XML structures and content.  So far all is looking good, and I’m about halfway through checking the documents.  I’ll continue with this task next week.

Week Beginning 24th May 2021

I had my first dose of the Covid vaccine on Tuesday morning this week (the AstraZeneca one), so I lost a bit of time whilst going to get that done.  Unfortunately I had a bit of a bad reaction to it and ended up in bed all day Wednesday with a pretty nasty fever.  I had Covid in October last year but only experienced mild symptoms and wasn’t even off work for a day with it, so in my case the cure has been much worse than the disease.  However, I was feeling much better again by Thursday, so I guess I lost a total of about a day and a half of work, which is a small price to pay if it helps to ensure I don’t catch Covid again and (what would be worse) pass it on to anyone else.

In terms of work this week I continued to work on the Anglo-Norman Dictionary, beginning with a few tweaks to the data builder that I had completed last week.  I’d forgotten to add a bit of processing to the MS date that was present in the Text Date section to handle fractions, so I added that in.  I also updated the XML output so that ‘pref’ and ‘suff’ only appear if they have content now, as the empty attributes were causing issues in the XML editor.

I then began work on the largest outstanding task I still have to tackle for the project: the migration of the textbase texts to the new site.  There are about 80 lengthy XML digital editions on the old site that can be searched and browsed, and I need to ensure these are also available on the new site.  I managed to grab a copy of all of the source XML files and I tracked down a copy of the script that the old site used to process the files.  At least I thought I had.  It turned out that this file actually references another file that must do most of the processing, including the application of an XSLT file to transform the XML into HTML, which is the thing I really could do with getting access to.  Unfortunately this file was no in the data from the server that I had been given access to, which somewhat limited what I could do.  I still have access to the old site and whilst experimenting with the old textbase I managed to make it display an error message that gives the location of the file: [DEBUG: Empty String at /var/and/reduce/and-fetcher line 486. ].  With this location available I asked Heather, the editor who has access to the server, if she might be able to locate this file and others in the same directory.  She had to travel to her University in order to be able to access the server, but once she did she was able to track the necessary directory down and get a copy to me.  This also included the XSLT file, which will help a lot.

I wrote a script to process all of the XML files, extracting titles, bylines, imprints, dates, copyright statements and splitting each file up into individual pages.  I then updated the API to create the endpoints necessary to browse the texts and navigate through the pages, for example the retrieval of summary data for all texts, or information about a specified texts, or information about a specific page (including its XML).  I also began working on a front-end for the textbase, which is still very much in progress.  Currently it lists all texts with options to open a text at the first available page or select a page from a drop-down list of pages.  There are also links directly into the AND bibliography and DEAF where applicable, as the following screenshot demonstrates:

It is also possible to view a specific page, and I’ve completed work on the summary information about the text and a navbar through which it’s possible to navigate through the pages (or jump directly to a different page entirely).  What I haven’t yet tackled is the processing of the XML, which is going to be tricky and I hope to delve into next week.   Below is a screenshot of the page view as it currently looks, with the raw XML displayed.

I also investigated and fixed an issue the editor Geert spotted, whereby the entire text of an entry was appearing in bold.  The issue was caused by an empty <link_form/> tag.  In the XSLT each <link_form> becomes a bold tag <b> with the content of the link form in the middle.  As there was no content it became a self-closed tag <b/> which is valid in XML but not valid in HTML, where it was treated as an opening tag with no corresponding closing tag, resulting in the remainder of the page all being bold.  I got around this by placing the space that preceded the bold tag “ <b></b>” within the bold tag instead “<b> </b>” meaning the tag is no longer considered empty and the XSLT doesn’t self-close it, but ideally if there is no <link_form> then the tag should just be omitted, which would also solve the problem.

I also looked into an issue with the proofreader that Heather encountered.  When she uploaded a ZIP file with around 50 entries in it some of the entries wouldn’t appear in the output, but would just display their title.  The missing entries would be random without any clear reason as to why some were missing.    After some investigation I realised what the problem was:  each time an XML file is processed for display the DTD referenced in the file was being checked.  When processing lots of files all at once this was exceeding the maximum number of file requests the server was allowing from a specific client and was temporarily blocking access to the DTD, causing the processing of some of the XML files to silently fail.  The maximum number would be reached at a different point each time, thus meaning a different selection of entries would be blank.  To fix this I updated the proofreader script to remove the reference to the DTD from the XML files in the uploaded ZIP before they are processed for display.  The DTD isn’t actually needed for the display of the entry – all it does is specify the rules for editing it.  With the DTD reference removed it looks like all entries are getting properly displayed.

Also this week I gave some further advice to Luca Guariento about a proposal he’s working on, fixed a small display issue with the Historical Thesaurus and spoke to Craig Lamont about the proposal he’s putting together.  Other than that I spent a bit of time on the Dictionary of the Scots Language, creating four different mockups of how the new ‘About this entry’ box could look and investigating why some of the bibliographical links in entries in the new front-end were not working.  The problem was being caused by the reworking of cref contents that the front-end does in order to ensure only certain parts of the text become a link.  In the XML the bib ID is applied to the full cref, (e.g. <cref refid=”bib018594″><geo>Sc.</geo> <date>1775</date> <title>Weekly Mag.</title> (9 Mar.) 329: </cref>) but we wanted the link to only appear around titles and authors rather than the full text.  The issue with the missing links was cropping up where there is no author or title for the link to be wrapped around (e.g. <cit><cref refid=”bib017755″><geo>Ayr.</geo><su>4</su> <date>1928</date>: </cref><q>The bag’s fu’ noo’ we’ll sadden’t.</q></cit>).  In such cases the link wasn’t appearing anywhere.  I’ve updated this now so that if no author or title is found then the link gets wrapped around the <geo> tag instead, and if there is no <geo> tag the link gets wrapped around the whole <cref>.

I also fixed a couple of advanced search issues that had been encountered with the new (and as yet not publicly available) site.  There was a 404 error that was being caused by a colon in the title.  The selected title gets added into the URL and colons are special characters in URLs, which was causing a problem.  However, I updated the scripts to allow colons to appear and the search now works.  It also turned out that the full-text searches were searching the contents of the <meta> tag in the entries, which is not something that we want.  I knew there was some other reason why I stripped the <meta> section out of the XML and this is it.  The contents of <meta> end up in the free-text search and are therefore both searchable and returned in the snippets.  To fix this I updated my script that generates the free-text search data to remove <meta> before the free-text search is generated.  This doesn’t remove it permanently, just in the context of the script executing.  I regenerated the free-text data and it no longer includes <meta>, and I then passed this on to Arts IT Support who have the access rights to update the Solr collection.  With this in place the advanced search no longer does anything with the <meta> section.