Week Beginning 26th November

This was my first week in my new office and while the room is lovely the heating has been underperforming somewhat, meaning I’ve been chilled to the bone by lunchtime most days.  Thankfully some heating engineers worked their magic on Thursday and after that the office has been nice and toasty.

I spent a lot of time this week continuing to develop the ‘Readings’ app that I started last week.  In fact, I have completed this app now, in standard HTML5.  This version can be accessed here: http://www.arts.gla.ac.uk/STELLA/briantest/readings/ (but note that this is a test URL and will probably be taken down at some point).  All the content for Old, Middle and Early Modern English is present, including all sound files, texts, translations and notes.  After completing this work I started to look into wrapping the website up as an app and deploying it.  Unfortunately I haven’t quite managed to get Phonegap (or Apache Cordova as the open source version is properly know as http://docs.phonegap.com/en/2.2.0/index.html) working on my PC yet.  I spent a frustrating couple of hours on Friday afternoon trying to set it up but by the end of the day I was still getting errors.  Next week I will continue with this task.

One limitation to app development will be that developing apps for iOS requires not only a Mac but also paying Apple $99 per year for a developer certificate.  I’ll have to see whether this is going to be feasible.  It might be possible to arrange something through STELLA and Marc.

Also this week I continued to develop the Digital Humanities Network website, fixing a few issues, such as ‘subjects’ not working properly.  I also created a new way of recording project PIs as the current system was a bit inefficient and led to people being recorded with different names (e.g. sometimes with ‘Professor’, othertimes without).  Now PIs are only recorded in the system once and then linked to as many projects as required.  I also updated the ‘projects’ page so that it is possible to view projects linked to a specific PI.  And finally, I asked some people to sign up with the site and we now have a decent selection of people represented.  More would still be good though!

My other major task this week was to work some more with the Burns website.  I started last week to look into having sub-pages for each song, and this week I found a solution which I have now implemented on my local test installation of the website.  I reached the solution in a bit of a round-about way unfortunately.  I initially intended song ‘pages’ to be blog posts and to have a category listing in the menu to enable drop-down access to the individual song ‘pages’.  I thought this would work quite nicely as it would allow commenting on the song pages, and it would still also allow an HTML5 player to be embedded within the blog content.  However, the more I looked into this solution the more I realised it was far from ideal.  You can’t have a drop-down list of blog pages from a menu in WordPress (which is understandable as there could be thousands of blog posts) so I had to create subcategories that would only be used for one single post.  Plus when viewing the blog archives or other blog views the song pages would be all mixed in with the proper blog pages.  Instead I found a much easier way of having sub-pages represented in the menu bar as drop-down items and added these instead.  At the moment I’ve had to activate commenting on all pages in order for users to be able to post comments about songs.  There will be a way to state that comments should not be possible on certain pages but I still need to find a way to do this.

Also this week I attended a further meeting of the Corpus Workgroup, which was useful. We are all very happy with the way the test server is working out and we now need to get a dedicated server for the Corpus software.  The next step of development will be to try and get multiple front-ends working with the data, which should be an interesting task.

Week Beginning 19th November

I am writing this week’s post from the delightful surroundings of my new office.  It’s been almost three months since I started the job, and although it has been great spending that time with my old HATII colleagues it feels very pleasant to finally be in my own office!

I began this week by completing work on the revamped Digital Humanities Network pages that I was working on last week.  I spent most of Monday tweaking the pages, adding sample content and fixing a few bugs that had reared their heads.  By the end of the day I had emailed Ann, Jeremy, Marc and Graeme about the pages and received favourable feedback during the course of the week.  On Friday Marc, Ann, Graeme and I met to discuss the pages and to decide who should write the site text that still needs to be supplied.

I spent the majority of this week working on the STELLA apps, something I’ve been meaning to start for several weeks.  Initially I focussed on looking into some possible Javascript / HTML5 / CSS3 frameworks that I might be able to utilise to develop the apps.  I’m a big fan of Jquery and I really liked the look of Jquerymobile (http://jquerymobile.com/).  It provides a wide array of UI widgets and structural page elements that can be configured to work really quickly and effectively.  After reading up on the framework I began creating some actual pages and really liked how it worked.  My only concern is that it uses a variety of custom HTML5 attributes to handle themes and structure, e.g. <div data-role=”page” data-theme=”b”> specifies this div is a page-level element and it has the styles for theme ‘b’ applied to it.  Although jquerymobile has been develop to be cross-platform and to fail gracefully when used with older browsers or with Javascript disabled I do still have a slight concern that in a few years time these attributes will be completely outdated.  Having said that, Jquery has a massive amount of support and is very widely adopted so I’m hopeful their custom attributes will continue to work for many years to come.

I decided to start developing the ‘Readings in Early English’ app as I figured this would be the simplest to tackle seen as it has no exercises built into it.  I familiarised myself with the Jquerymobile framework and built some test pages, and by the end of the week I had managed to put together an interface that was pretty much identical to the Powerpoint based mock-ups that I had made previously.  Currently only the ‘Old English’ section contains content, but within this section you can open a ‘reading’ and play the sound clip using HTML5’s <audio > tag, through which the user’s browser embeds an audio player within the page.  It works really smoothly and requires absolutely no plug-in to work.  The ‘reading’ pages also feature original texts and translations / notes.  I created a little bit of adaptive CSS using Jquery to position the translation to the right of the original text if the browser’s window is over 500px wide, or underneath the original text if the window is smaller than this.  It works really well and allows the original text and the translation to be displayed side by side when the user has their phone in landscape mode, automatically switching to displaying the translation beneath the original text when they flip their phone to portrait mode.  I’m really happy with how things are working out so far, although I still need to see about wrapping the website as an app.  Plus the websites that have a lot of user interaction (i.e. exercises) are going to be a lot more challenging to implement.

The test version of the site can be found here: http://www.arts.gla.ac.uk/STELLA/briantest/readings/ although you should note that this is a test URL and content is liable to be removed or broken in future.

Also this week I met with Marc to discuss the Hansard texts and the Test Corpus Server.  Although I managed to get over 400 texts imported into the corpus this really is just a drop in the ocean as there are more than 2.3 million pages of text in the full body.  It’s going to be a massive undertaking to get all these text and their metadata formatted for display and searching, and we are thinking of developing a Chancellor’s Fund bid to get some dedicated funds to tackle the issue.  There may be as many as 2 billion words in the corpus!

I also found some time this week to look into some of the outstanding issues with the Burns website.  I set up a local instance of the website so I could work on things without messing up the live content.  What I’m trying to do at the moment is make individual pages for each song that is listed in the ‘Song & Music’ page.  It sounds like a simple task but it’s taking a little bit of work to get right.  I will continue with this task on Monday next week and will hopefully get something ready to deploy on the main site next week.

Week Beginning 12th November

I devoted the beginning of this week to corpus matters, continuing to work with the Hansard texts that I spent a couple of days looking into last week.  By the end of last week I had managed to write a PHP script that could read all 428 sample Hansard texts and hopefully spit them out in a format that would be suitable for upload into our Open Corpus Workbench server.

I ran into a few problems when I came to upload these files on Monday morning.  Firstly, I hadn’t wrapped each text in the necessary <text id=””> tags, something that was quickly rectified.  I then had to deal with selecting all 400-odd files for import.  The front-end lists all files in the upload area with a checkbox beside each, and no ‘select all’ option.  Thankfully the Firefox developer’s toolbar has an option allowing you to automatically tick all checkboxes, but unfortunately the corpus front-end posts forms using the GET rather than POST method, so any form variables are appended to the URL when the request is sent to the server.  A URL with 400 filenames attached is too long for the server to process and results in an error so it was back to the drawing board.  Thankfully a solution presented itself to me fairly quickly:  You don’t need to have a separate text file for every text in your corpus, you can have any number of texts bundled together into one text file, separated by those handy <text> tags.  A quick update of my PHP script later and I had one 2.5mb text file rather than 428 tiny text files, and this was quickly and successfully imported into the corpus server, with part-of-speech, lemma and semantic tags all present and correct.

After dealing with the texts came the issue of the metadata.  Metadata is hugely important if you want to be able to restrict searches to particular texts, speakers, dates etc.  For Hansard I identified 9 metadata fields from the XML tags stored in the text files:  Day, Decade, Description, Member, Member Title, Month, Section, URL, Year.  I created another PHP script that read through the sample files and extracted the necessary information and create a tab-delimited text file with one row per input file and one column per metadata item.  It wasn’t quite this simple though as the script also had to create IDs for each distinct metadata value and use these IDs in the tab-delimited file rather than the values, as this is what the front-end expects. In the script I held the actual values in arrays so that I could then use these to insert both the values and their corresponding IDs directly into the underlying database after the metadata text file was uploaded.

Metadata file upload took place in several stages:

  1. The tab delimited text file with one row per text and one column per metadata element (represented as IDs rather than values) was uploaded.
  2. A form was filled in telling the system the names of each metadata column in the input file
  3. At this point the metadata was searchable in the system, but only using IDs rather than actual values.  E.g. You could limit your search by ‘Member’ but each member was listed as a number rather than an actual name.
  4. It is possible to use the front end to manually specify which IDs have which value, but as my test files had more than 200 distinct metadata values this would have taken too long.
  5. So instead I created a PHP script that inserted these values directly into the database
  6. After doing the above restricted queries worked.  E.g. You can limit a search to speaker ‘Donald Dewar’ or topic ‘Housing (Scotland)’ or a year or decade (less interesting for the sample texts as they are all from one single day!)

After I finished the above I spent the majority of the rest of the week working on the new pages for the Digital Humanities Network.  The redevelopment of these pages has taken a bit longer than I had anticipated as I decided to take the opportunity to get to grips with PHP data objects (http://php.net/manual/en/book.pdo.php) as a more secure means of executing database queries.  It has been a great opportunity to learn more about these, but it has meant I’ve needed to redevelop all of the existing scripts to use this new method, and also spend more time than I would normally need getting the syntax of the queries right.  By the end of the week I had just about completed every feature that was discussed at the Digital Humanities meeting a few weeks ago.  I should be able to complete the pages next Monday and will email the relevant people for feedback after that.

I also spent a little time this week reading through some materials for the Burns project and thinking about how to implement some of the features they require.

I still haven’t managed to make a start on the STELLA apps but the path should now be clear to begin on these next week (although I know it’s not the first time I’ve said that!)

Week Beginning 5th November 2012

I spent a further 1-2 days this week working on the fixes for the advanced search of the SCOTS Corpus, and the test version can be found here:  http://www.scottishcorpus.ac.uk/corpus/search/advanced-test-final.php. (Note that this is only a temporary URL and once the updates get signed off I’ll replace the existing advanced search with this new one and the above URL will no longer work).  The functionality and the results displayed by the new version should be identical to the old version.  There are however a couple of things that are slightly different:

1.  The processing of the summary, map, concordance and document list is handled in an asynchronous manner, meaning that these elements all load in independently of each other, potentially at different speeds.  For this reason each of these sections now has its own ‘loading’ icon.  The summary has the old style animated book icon while the other sections have little spinning things.  I’m not altogether happy with this approach and I might try to get one overall ‘loading’ icon working instead.

2.  Selecting to display or hide the map, concordance or document list now does so immediately rather than having to re-execute the query.  Similarly, updating the map flags only reloads the map rather than every part of the search results.  This approach is faster.

3.  I’ve encountered some difficulty with upper / lower case words in the concordance.  The XSLT processor used by PHP uses a case sensitive sort, which means (for example) that ordering the concordance table by the word to the left of the node is ordering A-Z and then a-z.  I haven’t found a solution to this yet but I am continuing to investigate.

I’ve tried to keep as close as I can to the original structure of the advanced search code (PHP queries the database and generates an XML file which is then transformed by a series of XSLT scripts to create fragments of HTML content for display).  Now that all the processing is being done at the server side this isn’t necessarily the most efficient way to go about things, for example in some places we could completely bypass the XML and XSLT stage and just use PHP to create the HTML fragments directly from the database query.

If the website is more thoroughly redeveloped I’d like to return to the search functionality to try and make things faster and more efficient.  However, for the time being I’m hoping the current solution will suffice (depending on whether the issues mentioned above are a big concern or not).

It should also be noted that the advanced search (in both its original and ‘fixed’ formats) isn’t particularly scalable – there is no pagination of results and a search for a word that brings back large numbers of results will cause both old and new versions to fall over.  For example, a search for ‘the’ brings back about a quarter of a million hits, and the advanced search attempts to process and display all of these in the doclist, concordance and map on one page, which is far too much data for one page to realistically handle.  Another thing to address if the site gets more fully redeveloped!

I spent about half a day this week working for the Burns project, completing the migration of the data from their old website to the new.  This is now fully up and running (http://burnsc21.glasgow.ac.uk/) and I’ve made some further tweaks to the site, implementing nicer title based URLs and fixing a few CSS issues such as the background image not displaying properly on widescreen monitors.

I dedicated about a day this week to looking into the updates required for the Digital Humanities Network pages, which were decided upon at the meeting a couple of weeks ago with Graeme, Ann and Marc.  I’ve updated the existing database to incorporate the required additional fields and tables and I’ve created a skeleton structure for the new site.  I also used this time to look into a more secure manner of running database queries in PHP – PHP Data Objects (PDO).  It’s an interface that sites between PHP and the underlying database and allows prepared statements and stored procedures.  It is very good at preventing SQL injection attacks and I intend to use this interface for all database queries in future.

I spent the remainder of the week getting back into the Open Corpus Workbench server that I am working on with Stevie Barrett in Celtic.  My main aim this week was to get a large number of the Hansard texts Marc had given me uploaded into the corpus.  As is often the case, this proved to be trickier than first anticipated.  The Hansard texts have been tagged with Part of Speech , Lemma and Semantic Tags all set up nicely in tab delimited text files which have the <s> tags that the server needs too.  They also include a lot of XML tags containing metadata that can be used to provide limiting options in the restricted query.  Unfortunately the <s> tabs have been added to the existing XML files in a rather brutal manner – stuck in between tags, at the start of the file before the initial XML definition etc.  This means the files are very far from being valid XML.

I was intending to develop an XSLT script that would reformat the texts for input but XSLT requires XML input files to be well formed, so that idea was a no go.  I decided instead to read the files into PHP and to split the files up by the <s> tag, processing the contents of each section in turn in order to extract the metadata we want to include and the actual text we want to be logged in the corpus.  As the <s> tags were placed so arbitrarily it was very difficult to develop a script that caught all possible permutations.  However, by the end of the week I had constructed a script that could successfully process all 428 text files that Marc had given me (and will hopefully be able to cope with the remaining data when I get it).  Next week I will update the script to complete the saving of the extracted metadata in a suitable text file and I will then attempt the actual upload to the corpus.

I’m afraid I have been unable to find the time this week to get started on the redevelopment of any of the STELLA applications.  Once Hansard is out of the way next week I should hopefully have the time to get started on these in earnest.

Week Beginning 29th October 2012

Back from my week off this week, with lots to do.  Whilst I was away Justin Livingstone of SCS got in touch with HATII about a project proposal related to David Livingstone.  Thankfully my colleagues in HATII were able to provide Justin with some helpful advice before his deadline, and I contacted Justin to offer my services if he needs them in future.

One of the pressing outstanding issues on my ‘to do’ list has been to fix the Scottish Corpus Advanced Search page, which is currently broken in the most recent versions of IE and also Chrome.  As I had received access to the server the week before my holiday and I had a Corpus meeting set up with Wendy, Jane and Marc this Thursday it seemed like a good time to tackle this issue.  I had been hoping that the problem would be a relatively simple Javascript issue but as I begin to delve into the code it became clear that solving the problem was going to be a larger undertaking.

The SCOTS advanced search page (http://www.scottishcorpus.ac.uk/corpus/search/advanced.php) allows a massive amount of customisation, with a range of display options for the results including a Google map, a concordance and a document list.  The page works by creating a query, feeding it to the database and generating an XML file from the database results.  This XML file, together with several XSLT files are then pulled into the user’s browser for processing using a Javascript plugin called Sarissa (http://sourceforge.net/projects/sarissa/).

Unfortunately it is this plugin that doesn’t work with Chrome and IE – see this test page: http://dev.abiss.gr/sarissa/test/testsarissa.html.  Sarissa is needed because different browsers have different processors for working with XML and XSLT files.  But although it set out to solve some of the incompatibility problems it has introduced others. These days it’s generally considered a bit messy to rely on client-side browsers to process XSLT – far better to process the XML using XSLT on the server and then use AJAX to pull in the transformed text from the server and display it.  This is what I set out to do with the advanced search page.

This has basically required me to rip out the heart of the advanced search page and build a new one.  This task has taken up most of the week, but I am just about ready to launch the new version.  It works in both Chrome and IE and handles all XSLT on the server side.  It also uses a JSON file for populating the Google Map as JSON is much easier to work with as a data source for Javascript compared to XML.  I’ve also introduced the Jquery Javascript library as it vastly simplifies the Javascript needed to work with page elements and AJAX.  I still need to add the ‘loading’ spinner to my new version of the advanced search and to properly test it but it should be possible to go live with this new version next week.

I had been hoping to work some more on the Open Corpus Workbench server this week, plus make a start on the STELLA applications but due to getting bogged down in the SCOTS advanced search both of these tasks will have to wait until next week.

I did manage to do some other tasks this week, however.  As mentioned earlier, I had a meeting with Wendy, Marc and Jane to discuss Corpora in general at the university.  This was a useful meeting and it was especially interesting to hear about the sound corpora that Jane is working with and the software her team are using (something called LabCat).  I’ll need to investigate this further.

Also this week I had a further meeting with Alison Wiggins regarding possible mobile and tablet versions of Bess of Hardwick.  We had a bit of a brainstorming session and came up with some possible ideas for the tablet version that could form the basis of a bid.  We also got in touch with Sheffield to ask about access to the server in order to develop a simple mobile interface to the existing site.  Although it won’t be possible for me to get access to the server directly, we reached an agreement whereby I will develop the interface around static HTML pages here in Glasgow and then I’ll send my updated interface to Sheffield for inclusion.

My final task of the week was to migrate the ‘Editing Robert Burns for the 21st Century’ website to a more official sounding URL.  I haven’t quite managed to complete this task yet but I’ll get it all sorted on Monday next week.