Week Beginning 5th November 2012

I spent a further 1-2 days this week working on the fixes for the advanced search of the SCOTS Corpus, and the test version can be found here:  http://www.scottishcorpus.ac.uk/corpus/search/advanced-test-final.php. (Note that this is only a temporary URL and once the updates get signed off I’ll replace the existing advanced search with this new one and the above URL will no longer work).  The functionality and the results displayed by the new version should be identical to the old version.  There are however a couple of things that are slightly different:

1.  The processing of the summary, map, concordance and document list is handled in an asynchronous manner, meaning that these elements all load in independently of each other, potentially at different speeds.  For this reason each of these sections now has its own ‘loading’ icon.  The summary has the old style animated book icon while the other sections have little spinning things.  I’m not altogether happy with this approach and I might try to get one overall ‘loading’ icon working instead.

2.  Selecting to display or hide the map, concordance or document list now does so immediately rather than having to re-execute the query.  Similarly, updating the map flags only reloads the map rather than every part of the search results.  This approach is faster.

3.  I’ve encountered some difficulty with upper / lower case words in the concordance.  The XSLT processor used by PHP uses a case sensitive sort, which means (for example) that ordering the concordance table by the word to the left of the node is ordering A-Z and then a-z.  I haven’t found a solution to this yet but I am continuing to investigate.

I’ve tried to keep as close as I can to the original structure of the advanced search code (PHP queries the database and generates an XML file which is then transformed by a series of XSLT scripts to create fragments of HTML content for display).  Now that all the processing is being done at the server side this isn’t necessarily the most efficient way to go about things, for example in some places we could completely bypass the XML and XSLT stage and just use PHP to create the HTML fragments directly from the database query.

If the website is more thoroughly redeveloped I’d like to return to the search functionality to try and make things faster and more efficient.  However, for the time being I’m hoping the current solution will suffice (depending on whether the issues mentioned above are a big concern or not).

It should also be noted that the advanced search (in both its original and ‘fixed’ formats) isn’t particularly scalable – there is no pagination of results and a search for a word that brings back large numbers of results will cause both old and new versions to fall over.  For example, a search for ‘the’ brings back about a quarter of a million hits, and the advanced search attempts to process and display all of these in the doclist, concordance and map on one page, which is far too much data for one page to realistically handle.  Another thing to address if the site gets more fully redeveloped!

I spent about half a day this week working for the Burns project, completing the migration of the data from their old website to the new.  This is now fully up and running (http://burnsc21.glasgow.ac.uk/) and I’ve made some further tweaks to the site, implementing nicer title based URLs and fixing a few CSS issues such as the background image not displaying properly on widescreen monitors.

I dedicated about a day this week to looking into the updates required for the Digital Humanities Network pages, which were decided upon at the meeting a couple of weeks ago with Graeme, Ann and Marc.  I’ve updated the existing database to incorporate the required additional fields and tables and I’ve created a skeleton structure for the new site.  I also used this time to look into a more secure manner of running database queries in PHP – PHP Data Objects (PDO).  It’s an interface that sites between PHP and the underlying database and allows prepared statements and stored procedures.  It is very good at preventing SQL injection attacks and I intend to use this interface for all database queries in future.

I spent the remainder of the week getting back into the Open Corpus Workbench server that I am working on with Stevie Barrett in Celtic.  My main aim this week was to get a large number of the Hansard texts Marc had given me uploaded into the corpus.  As is often the case, this proved to be trickier than first anticipated.  The Hansard texts have been tagged with Part of Speech , Lemma and Semantic Tags all set up nicely in tab delimited text files which have the <s> tags that the server needs too.  They also include a lot of XML tags containing metadata that can be used to provide limiting options in the restricted query.  Unfortunately the <s> tabs have been added to the existing XML files in a rather brutal manner – stuck in between tags, at the start of the file before the initial XML definition etc.  This means the files are very far from being valid XML.

I was intending to develop an XSLT script that would reformat the texts for input but XSLT requires XML input files to be well formed, so that idea was a no go.  I decided instead to read the files into PHP and to split the files up by the <s> tag, processing the contents of each section in turn in order to extract the metadata we want to include and the actual text we want to be logged in the corpus.  As the <s> tags were placed so arbitrarily it was very difficult to develop a script that caught all possible permutations.  However, by the end of the week I had constructed a script that could successfully process all 428 text files that Marc had given me (and will hopefully be able to cope with the remaining data when I get it).  Next week I will update the script to complete the saving of the extracted metadata in a suitable text file and I will then attempt the actual upload to the corpus.

I’m afraid I have been unable to find the time this week to get started on the redevelopment of any of the STELLA applications.  Once Hansard is out of the way next week I should hopefully have the time to get started on these in earnest.

Leave a Reply

Your email address will not be published. Required fields are marked *