It was the late May bank holiday on Monday, so this was a four-day week. On Tuesday I decided to try working at my office at the University – my first full day back at my office since the first lockdown began. All went very smoothly; I didn’t meet anyone in the building and it seemed very quiet on campus generally. The only issue was the number of updates my computer had to install, which caused some delays. I’m probably going to try and come back to work on Tuesdays on a semi-regular basis now to see how things go.
I had some discussions with Marc and Arts IT Support this week about the possibility of purchasing a new server, and some progress is being made there. I also responded to a query regarding the Scots Syntax Atlas that Jennifer Smith forwarded on to me and spoke to Roslyn Potter about a project that a lecturer in History is needing a website for.
Other than these tasks I spent the week continuing to work on the Textbase feature of the Anglo-Norman Dictionary. Last week I’d left off with the infrastructure in place to browse texts, display the raw XML of pages and navigate between pages. My task for this week was to ensure that the XML displayed properly. This proved to be rather tricky as although I had managed to get access to the XSLT file that the Textbase on the old site used to transform the XML to HTML, it included a lot of stuff that wasn’t needed in the new site (e.g. formatting headers and footers) and also gave errors when plugged directly into the new system. For these reasons I had to adapt the XSLT. Also, I’d split up the full XML files into chunks for each page, resulting in more than 12,000 chunks. However, the XML often included elements that extended across pages, and when the content was extracted on a per-page basis this led to an invalid XML structure, as some tags ended up missing their closing tags, or closed without featuring an opening tag. XSLT only works on valid XML files so I needed to fid a way to fix this tag issue. After some Googling I discovered that there is a PHP extension called Tidy (https://www.php.net/manual/en/intro.tidy.php) that can take an invalid XML file and fix it. What this does is to strip out all tags that don’t have an opening or closing tag, which is exactly what I wanted. I wrote a little script that used the extension, tested it successfully on a few files and then ran all of the 12,000 pages through it.
With a full set of valid XML page files I then began work on the XSL to display the documents as required. This has been a very laborious process as I needed to go through each of the more than 70 documents and check the layout for any issues, and fix these as they cropped up. With more than 12,000 pages I couldn’t look at each individually, but instead took a random selection, a process that’s is working pretty well so far. The largest challenge was getting the explanatory notes to appear correctly, as these had been tagged in at least eight different ways throughout the documents, sometimes with entirely different XML structures and content. So far all is looking good, and I’m about halfway through checking the documents. I’ll continue with this task next week.