Another short week as I’m off on Friday, and all of next week too. There’s not a massive amount to report this week. I spent a bit more time working on the AHRC bid for Nigel, mostly just tying up a few loose ends and gathering some final costings. On Tuesday I met with Stevie Barratt to help him with the Corpus Server. He was wanting to set up a separate corpus instance specifically for DASG but ran into problems getting things set up. I spent a fair bit of time going through old emails and trying to remember how we dealt with the software last time, as it has been several months since I last looked into Corpus issues. Stevie and I spent a couple of hours going through things but although a little bit of progress was made we didn’t manage to crack to problem by the end of this time. There is some issue preventing the install script from creating the required tables in MySQL, which is very odd as we have updated the privileges to be exactly the same as when we set things up last time and reloaded MySQL so the changes would take effect. It was all a bit frustrating and it was disappointing not to be able to get to the bottom of things. I hope Stevie manages to find the cause of the problem.
Other than corpus and AHRC related matters I spent some more time working on the mock-ups for the DSL website, including a fifth version of the interface plus adding homepage content to the mock-ups to see how the layouts look with actual content in place. I emailed the URLs for the designs to the SLD people on Thursday and will continue to work on the mock-ups when I’m back from my holidays based on any feedback I hear from them.
Monday was a holiday this week, so I plunged back into the redevelopment of the Historical Thesaurus website on Tuesday, after being ill and then working on other things last week. The big thing I tackled this week was the implementation of the Advanced Search. This is now fully operational, although it is a bit slow when category information is added to the search criteria – I’ll need to look into this further. But it does now work – you can search for words, parts of speech, labels, categories and dates (and any combination of these). I have updated the layout of the search page, adding Jquery UI tabs to split up quick, advanced and ‘jump to category’ searches and adding in help text as hover-overs. I may have to look at alternatives to this, though, as hover-overs don’t work on touchscreens. I also tweaked the ‘jump to category’ page to make the ‘t’ boxes automatically focus the next field when two characters are entered in a box, which vastly speeds up the entering of information. I’ve also made the search form ‘remember’ what a user has searched for, enabling search term refinement, and I made sure that the correct tab loads when following links from the ‘category selection’ page so for example if you’ve done an advanced search you don’t end up looking at the quick search.
I spent quite a bit of time this week working on improvements to the user interface of the website, which has been enjoyable. I’ve updated the homepage now so there are three blocks of content – introductory text, the quick search and a random category. The random category feature pulls back a category form the database that has at least one word in it each time the page loads, displaying up to 10 words from the category each time. It’s quite a nice feature, and a good way to jump straight into the category browse pages. Also this week I created a new script that generates the full HT category hierarchy from a given point in the system. Simply pass a category ID to the function and get an array of all parent categories. This is a very useful piece of code, and I’ve added it to both the random category feature and the category page, allowing users to jump straight to any point higher up in the HT hierarchy.
I also implemented some major updates to the category pages. I completely overhauled the way subcategories are displayed. Previously they were just displayed in a long line with no indication as to which subcategories were direct children of the main category and which were actually subcategories of other subcategories. This has now been rectified using indentation and changes in shade. Top level subcategories have no indent and a white background, level two subcategories are indented and have a slightly darker background, level three more so etc. I think it works pretty well. It can take up a lot of space for categories that have many subcategories though, and for this reason I’ve used a bit of jQuery to hide the list until a button is clicked on. I’ve also updated the links back to the main category from a subcategory so that the user is taken back to the open list of subcategories if following this navigation path, rather than being taken back to the top of the main category page and then having to scroll down to the list of subcategories and open it again.
I further updated the category page to improve the layout of the hierarchy traversal options and the options to view different parts of speech at the level currently being viewed. I think these navigation options work pretty well, but will await feedback from others.
I still need to do some further work with subcategories. Although the subcategory list is now hierarchical, subcategory pages are still actually all at one level. No matter how ‘deep’ the subcategory the only link back is to the main category, rather than to a parent subcategory. I will need to tackle this next week.
I met with Marc and Christian on Thursday and we spent a very useful couple of hours going through the site and tackling some of the questions that had accumulated since the last meeting. One outcome of the meeting is that I will need to update the way dates are searched for in the advanced search. Currently dates such as 1400/50 are recorded with 1400 in one column and 50 in another. I will need to update the database so that (for last cited dates) the later date is used. I will also need to update the search boxes to incorporate OE and Current options in both the first and the last cited date lines. There is also still a massive amount to do with the refresh of data from Access. That is going to be a rather large and somewhat daunting task.
Other than HT stuff this week I met with Stevie Barratt for a catch-up regarding the Corpus server. He had posted a question about redeveloping the user interface on the cqpweb mailing list and Andrew Hardie replied stating that separating out the user interface from the rest of the code to allow different layouts to be plugged in is not something that they are planning to tackle. They are hoping to develop an API, which would be hugely useful, but there is no timetable for this at the moment. It looks like Stevie is going to have to try and delve into the code and make changes directly to it, and he’s going to keep me posted on his progress, as eventually SCS will be wanting to use the same infrastructure. I also attended the HATII developers meeting on Thursday, which has now grown to encompass developers across the College of Arts, which is great. It is really useful to keep up to speed with projects and technical staff and know what people are working on.
I had an afternoon of meetings on Friday so it’s another Monday morning blog post from me. It was another busy week for me, more so because my son was ill and I had to take Tuesday off as holiday to look after him. This meant trying to squeeze into four days what I had hoped to tackle in five, which led to me spending a bit less time than I would otherwise have liked on the STELLA app development this week. I did manage to spend a few hours continuing to migrate the Grammar book to HTML5 but there are still a couple of sections still to do. I’m currently at the beginning of Section 8.
I did have a very useful meeting with Christian Kay regarding the ARIES app on Monday, however. Christian has been experiencing some rather odd behaviour with some of the ARIES exercises in the web browser on her office PC and I offered to pop over and investigate. It all centres around the most complicated exercise of all – the dreaded ‘Test yourself’ exercise in the ‘Further Punctuation’ section (see how it works for you here: http://www.arts.gla.ac.uk/STELLA/briantest/aries/further-punctuation-6-test-yourself.html). In stage 2 of the exercises clicking on words fails to capitalise them while in stage 3 adding an apostrophe also makes ‘undefined’ appear in addition to the apostrophe. Of course these problems are only occurring in Internet Explorer, but very strangely I am unable to replicate the problems in IE9 in Windows 7, IE9 in Windows Vista and IE8 in Windows XP! Christian is using IE8 in Windows 7, and it looks like I may have to commandeer her computer to try and fix the issue. As I am unable to replicate it on the three Windows machines I have access to it’s not really possible to try and fix the issue any other way.
Christian also noted that clicking quickly multiple times to get apostrophes or other punctuation to appear was causing the text to highlight, which is a bit disconcerting. I’ve implemented a fix for this that blocks the default ‘double click to highlight’ functionality for the exercise text. It’s considered bad practice to do such a thing (jQuery UI used to provide a handy function that did this very easily but they removed it – see http://api.jqueryui.com/disableSelection/ ) but in the context of the ARIES exercise its use is justifiable.
I also spent a little bit of time this week reworking the layout for the ICOS2014 conference website, although there is still some work to do with this. I’ve been experimenting with responsive web design, whereby the interface automatically updates to be more suitable on smaller screens (e.g. mobile devices). This is currently a big thing in interface design so it’s good for me to get a bit of experience with the concepts.
Following on from my meeting with Susan Rennie last week I created a three page technical specification document for the project that she is hoping to get funding for. This should hopefully include sufficient detail for the bid she is putting together and gives us a decent amount of information about how the technology used for the project will operate. Susan has also sent me some sample data and I will begin working with this to get some further, more concrete ideas for the project.
I also began work on the technical materials for the bid for the follow-on project for Bess of Hardwick. This is my first experience with the AHRC’s ‘Technical Plan’, which replaced the previous ‘Technical Appendix’ towards the end of last year. In addition to the supporting materials found on the AHRC’s website, I’m also using the Digital Curation Centre’s Data Management Planning Tool (https://dmponline.dcc.ac.uk/) which provides additional technical guidance tailored to many different funding applications, including the AHRC.
On Thursday I had a meeting with the Burns people about the choice of timeline software for the Burns Timeline that I will be putting together for them. In last week’s post I listed a few of the pieces of timeline software that I had been looking at as possibilities and at the meeting we went through the features the project requires. More than 6 categories are required, and the ability to search is a must, therefore the rather nice looking VeriteCo Timeline was ruled out. It was also decided that integration with WordPress would not be a good thing as they don’t want the Timeline to be too tightly coupled with the WordPress infrastructure, thus enabling it to have an independent existence in future if required. We decided that Timeglider would be a good solution to investigate further and the team is going to put together a sample of about 20 entries over two categories in the next couple of weeks so I can see how Timeglider may work. I think it’s going to work really well.
On Friday I met with Mark Herraghty to discuss some possibilities for further work for him and also for follow-on funding for Cullen. After that I met with Marc Alexander to discuss the bid we’re going to put together for the Chancellors’ fund to get someone to work on migrating the STELLA corpora to the Open Corpus Workbench. We also had a brief chat about the required thesaurus work and the STELLA apps. Following this meeting I had a conference call with Marc, Jeffrey Robinson and Karen Jacobs at Colorado University about Jeffrey’s Wordsworth project. It was a really useful call and Jeffrey and Karen are going to create a ‘wishlist’ of interactive audio-visual ideas for the project that I will then give technical input, in preparation for a face to face meeting in May.
I started off this week by continuing with the mobile interface for Bess of Hardwick, which I had begun last week. I managed to complete a first version of this interface that I’m pretty happy with. In addition to making the interface work with the full width of the browser and show / hide any navigation options I also managed to use Jquery to position some icons at the end of the sections of the letters, enabling a user to tap the icon to display a pop-up containing information about scribal hands. With the main site this information only appears when the user hovers the cursor over a section of a letter, so with a touchscreen there was no way to access this information. My little icons fix this, although it may well be that having icons dotted throughout each letter is considered a bit intrusive. I completed a mock-up version of the site that covers all of the pages (though obviously not all of the content – generally only one page per type, e.g. one letter, one search results page etc) and sent the URL to Alison for comment. I’m not going to make the URL available here as the main Bess site is still not publicly available and I don’t want to spoil the surprise of the main site!
After polishing off Bess I moved on to Burns. I spent a bit of time this week going through the existing site and compiling a list of possible improvements, combining this with the website document previously created by Pauline. I also got my local test version of the site installed on my laptop so I could demonstrate the changes I’d made at the Burns project meeting on Thursday. I also engaged in an email debate with one of the project’s US partners about the use of TEI and XML in general when creating digital editions. The project meeting on Thursday was very useful and took up most of the morning. We went through the website and discussed what should be changed and who should provide content for each section and it was a very positive meeting.
I had a further meeting with Stevie about the Corpus sever on Friday morning, which was also very productive. Stevie wanted to set up a local instance of the server on his laptop and we tackled this together. It was a good way to revise how to set up the server as we’ll have to do this again some time soon when we move from the test server to a proper server. I also spent a little time on Friday morning looking at the user interface for the front end. It shouldn’t be too difficult to adapt this interface, but there will be issues in doing so. A lot of the HTML is buried within functions deep within the code for the front end. Initially Stevie had an older version of the interface installed on his laptop and comparing this code to the more up to date version of the code we have on the server demonstrated that significant changes had been made between versions. If we create a new, bespoke interface for the College of Arts it will work perfectly with the current version of the front end (hopefully!) but when (or if) new versions of the front end are released there is no guarantee that our interface will continue to work. Ideally the front end would have its layout located in one place and changing it would be a simple process of replacing one set of layout scripts with another, but as a lot of the layout is buried in the code it’s going to be a bit messier and not really a sustainable solution. We’ve emailed Marc about this with the hope that he can initiate a dialogue with the creator of the front end to see where future developments may be headed and how our work may fit in with these.
For the rest of the week I began working on the mobile interface for the STELLA app ‘ARIES’. This is going to be interesting because it will be the first app that will require a lot of user interaction, e.g. dragging full stops into sentences and evaluating the results. At the moment I’m only putting up the site structure but next week I’ll start to look into how to handle the exercises.
This was my first week in my new office and while the room is lovely the heating has been underperforming somewhat, meaning I’ve been chilled to the bone by lunchtime most days. Thankfully some heating engineers worked their magic on Thursday and after that the office has been nice and toasty.
I spent a lot of time this week continuing to develop the ‘Readings’ app that I started last week. In fact, I have completed this app now, in standard HTML5. This version can be accessed here: http://www.arts.gla.ac.uk/STELLA/briantest/readings/ (but note that this is a test URL and will probably be taken down at some point). All the content for Old, Middle and Early Modern English is present, including all sound files, texts, translations and notes. After completing this work I started to look into wrapping the website up as an app and deploying it. Unfortunately I haven’t quite managed to get Phonegap (or Apache Cordova as the open source version is properly know as http://docs.phonegap.com/en/2.2.0/index.html) working on my PC yet. I spent a frustrating couple of hours on Friday afternoon trying to set it up but by the end of the day I was still getting errors. Next week I will continue with this task.
One limitation to app development will be that developing apps for iOS requires not only a Mac but also paying Apple $99 per year for a developer certificate. I’ll have to see whether this is going to be feasible. It might be possible to arrange something through STELLA and Marc.
Also this week I continued to develop the Digital Humanities Network website, fixing a few issues, such as ‘subjects’ not working properly. I also created a new way of recording project PIs as the current system was a bit inefficient and led to people being recorded with different names (e.g. sometimes with ‘Professor’, othertimes without). Now PIs are only recorded in the system once and then linked to as many projects as required. I also updated the ‘projects’ page so that it is possible to view projects linked to a specific PI. And finally, I asked some people to sign up with the site and we now have a decent selection of people represented. More would still be good though!
My other major task this week was to work some more with the Burns website. I started last week to look into having sub-pages for each song, and this week I found a solution which I have now implemented on my local test installation of the website. I reached the solution in a bit of a round-about way unfortunately. I initially intended song ‘pages’ to be blog posts and to have a category listing in the menu to enable drop-down access to the individual song ‘pages’. I thought this would work quite nicely as it would allow commenting on the song pages, and it would still also allow an HTML5 player to be embedded within the blog content. However, the more I looked into this solution the more I realised it was far from ideal. You can’t have a drop-down list of blog pages from a menu in WordPress (which is understandable as there could be thousands of blog posts) so I had to create subcategories that would only be used for one single post. Plus when viewing the blog archives or other blog views the song pages would be all mixed in with the proper blog pages. Instead I found a much easier way of having sub-pages represented in the menu bar as drop-down items and added these instead. At the moment I’ve had to activate commenting on all pages in order for users to be able to post comments about songs. There will be a way to state that comments should not be possible on certain pages but I still need to find a way to do this.
Also this week I attended a further meeting of the Corpus Workgroup, which was useful. We are all very happy with the way the test server is working out and we now need to get a dedicated server for the Corpus software. The next step of development will be to try and get multiple front-ends working with the data, which should be an interesting task.
I am writing this week’s post from the delightful surroundings of my new office. It’s been almost three months since I started the job, and although it has been great spending that time with my old HATII colleagues it feels very pleasant to finally be in my own office!
I began this week by completing work on the revamped Digital Humanities Network pages that I was working on last week. I spent most of Monday tweaking the pages, adding sample content and fixing a few bugs that had reared their heads. By the end of the day I had emailed Ann, Jeremy, Marc and Graeme about the pages and received favourable feedback during the course of the week. On Friday Marc, Ann, Graeme and I met to discuss the pages and to decide who should write the site text that still needs to be supplied.
I decided to start developing the ‘Readings in Early English’ app as I figured this would be the simplest to tackle seen as it has no exercises built into it. I familiarised myself with the Jquerymobile framework and built some test pages, and by the end of the week I had managed to put together an interface that was pretty much identical to the Powerpoint based mock-ups that I had made previously. Currently only the ‘Old English’ section contains content, but within this section you can open a ‘reading’ and play the sound clip using HTML5’s <audio > tag, through which the user’s browser embeds an audio player within the page. It works really smoothly and requires absolutely no plug-in to work. The ‘reading’ pages also feature original texts and translations / notes. I created a little bit of adaptive CSS using Jquery to position the translation to the right of the original text if the browser’s window is over 500px wide, or underneath the original text if the window is smaller than this. It works really well and allows the original text and the translation to be displayed side by side when the user has their phone in landscape mode, automatically switching to displaying the translation beneath the original text when they flip their phone to portrait mode. I’m really happy with how things are working out so far, although I still need to see about wrapping the website as an app. Plus the websites that have a lot of user interaction (i.e. exercises) are going to be a lot more challenging to implement.
The test version of the site can be found here: http://www.arts.gla.ac.uk/STELLA/briantest/readings/ although you should note that this is a test URL and content is liable to be removed or broken in future.
Also this week I met with Marc to discuss the Hansard texts and the Test Corpus Server. Although I managed to get over 400 texts imported into the corpus this really is just a drop in the ocean as there are more than 2.3 million pages of text in the full body. It’s going to be a massive undertaking to get all these text and their metadata formatted for display and searching, and we are thinking of developing a Chancellor’s Fund bid to get some dedicated funds to tackle the issue. There may be as many as 2 billion words in the corpus!
I also found some time this week to look into some of the outstanding issues with the Burns website. I set up a local instance of the website so I could work on things without messing up the live content. What I’m trying to do at the moment is make individual pages for each song that is listed in the ‘Song & Music’ page. It sounds like a simple task but it’s taking a little bit of work to get right. I will continue with this task on Monday next week and will hopefully get something ready to deploy on the main site next week.
I devoted the beginning of this week to corpus matters, continuing to work with the Hansard texts that I spent a couple of days looking into last week. By the end of last week I had managed to write a PHP script that could read all 428 sample Hansard texts and hopefully spit them out in a format that would be suitable for upload into our Open Corpus Workbench server.
I ran into a few problems when I came to upload these files on Monday morning. Firstly, I hadn’t wrapped each text in the necessary <text id=””> tags, something that was quickly rectified. I then had to deal with selecting all 400-odd files for import. The front-end lists all files in the upload area with a checkbox beside each, and no ‘select all’ option. Thankfully the Firefox developer’s toolbar has an option allowing you to automatically tick all checkboxes, but unfortunately the corpus front-end posts forms using the GET rather than POST method, so any form variables are appended to the URL when the request is sent to the server. A URL with 400 filenames attached is too long for the server to process and results in an error so it was back to the drawing board. Thankfully a solution presented itself to me fairly quickly: You don’t need to have a separate text file for every text in your corpus, you can have any number of texts bundled together into one text file, separated by those handy <text> tags. A quick update of my PHP script later and I had one 2.5mb text file rather than 428 tiny text files, and this was quickly and successfully imported into the corpus server, with part-of-speech, lemma and semantic tags all present and correct.
After dealing with the texts came the issue of the metadata. Metadata is hugely important if you want to be able to restrict searches to particular texts, speakers, dates etc. For Hansard I identified 9 metadata fields from the XML tags stored in the text files: Day, Decade, Description, Member, Member Title, Month, Section, URL, Year. I created another PHP script that read through the sample files and extracted the necessary information and create a tab-delimited text file with one row per input file and one column per metadata item. It wasn’t quite this simple though as the script also had to create IDs for each distinct metadata value and use these IDs in the tab-delimited file rather than the values, as this is what the front-end expects. In the script I held the actual values in arrays so that I could then use these to insert both the values and their corresponding IDs directly into the underlying database after the metadata text file was uploaded.
Metadata file upload took place in several stages:
- The tab delimited text file with one row per text and one column per metadata element (represented as IDs rather than values) was uploaded.
- A form was filled in telling the system the names of each metadata column in the input file
- At this point the metadata was searchable in the system, but only using IDs rather than actual values. E.g. You could limit your search by ‘Member’ but each member was listed as a number rather than an actual name.
- It is possible to use the front end to manually specify which IDs have which value, but as my test files had more than 200 distinct metadata values this would have taken too long.
- So instead I created a PHP script that inserted these values directly into the database
- After doing the above restricted queries worked. E.g. You can limit a search to speaker ‘Donald Dewar’ or topic ‘Housing (Scotland)’ or a year or decade (less interesting for the sample texts as they are all from one single day!)
After I finished the above I spent the majority of the rest of the week working on the new pages for the Digital Humanities Network. The redevelopment of these pages has taken a bit longer than I had anticipated as I decided to take the opportunity to get to grips with PHP data objects (http://php.net/manual/en/book.pdo.php) as a more secure means of executing database queries. It has been a great opportunity to learn more about these, but it has meant I’ve needed to redevelop all of the existing scripts to use this new method, and also spend more time than I would normally need getting the syntax of the queries right. By the end of the week I had just about completed every feature that was discussed at the Digital Humanities meeting a few weeks ago. I should be able to complete the pages next Monday and will email the relevant people for feedback after that.
I also spent a little time this week reading through some materials for the Burns project and thinking about how to implement some of the features they require.
I still haven’t managed to make a start on the STELLA apps but the path should now be clear to begin on these next week (although I know it’s not the first time I’ve said that!)
I spent a further 1-2 days this week working on the fixes for the advanced search of the SCOTS Corpus, and the test version can be found here: http://www.scottishcorpus.ac.uk/corpus/search/advanced-test-final.php. (Note that this is only a temporary URL and once the updates get signed off I’ll replace the existing advanced search with this new one and the above URL will no longer work). The functionality and the results displayed by the new version should be identical to the old version. There are however a couple of things that are slightly different:
1. The processing of the summary, map, concordance and document list is handled in an asynchronous manner, meaning that these elements all load in independently of each other, potentially at different speeds. For this reason each of these sections now has its own ‘loading’ icon. The summary has the old style animated book icon while the other sections have little spinning things. I’m not altogether happy with this approach and I might try to get one overall ‘loading’ icon working instead.
2. Selecting to display or hide the map, concordance or document list now does so immediately rather than having to re-execute the query. Similarly, updating the map flags only reloads the map rather than every part of the search results. This approach is faster.
3. I’ve encountered some difficulty with upper / lower case words in the concordance. The XSLT processor used by PHP uses a case sensitive sort, which means (for example) that ordering the concordance table by the word to the left of the node is ordering A-Z and then a-z. I haven’t found a solution to this yet but I am continuing to investigate.
I’ve tried to keep as close as I can to the original structure of the advanced search code (PHP queries the database and generates an XML file which is then transformed by a series of XSLT scripts to create fragments of HTML content for display). Now that all the processing is being done at the server side this isn’t necessarily the most efficient way to go about things, for example in some places we could completely bypass the XML and XSLT stage and just use PHP to create the HTML fragments directly from the database query.
If the website is more thoroughly redeveloped I’d like to return to the search functionality to try and make things faster and more efficient. However, for the time being I’m hoping the current solution will suffice (depending on whether the issues mentioned above are a big concern or not).
It should also be noted that the advanced search (in both its original and ‘fixed’ formats) isn’t particularly scalable – there is no pagination of results and a search for a word that brings back large numbers of results will cause both old and new versions to fall over. For example, a search for ‘the’ brings back about a quarter of a million hits, and the advanced search attempts to process and display all of these in the doclist, concordance and map on one page, which is far too much data for one page to realistically handle. Another thing to address if the site gets more fully redeveloped!
I spent about half a day this week working for the Burns project, completing the migration of the data from their old website to the new. This is now fully up and running (http://burnsc21.glasgow.ac.uk/) and I’ve made some further tweaks to the site, implementing nicer title based URLs and fixing a few CSS issues such as the background image not displaying properly on widescreen monitors.
I dedicated about a day this week to looking into the updates required for the Digital Humanities Network pages, which were decided upon at the meeting a couple of weeks ago with Graeme, Ann and Marc. I’ve updated the existing database to incorporate the required additional fields and tables and I’ve created a skeleton structure for the new site. I also used this time to look into a more secure manner of running database queries in PHP – PHP Data Objects (PDO). It’s an interface that sites between PHP and the underlying database and allows prepared statements and stored procedures. It is very good at preventing SQL injection attacks and I intend to use this interface for all database queries in future.
I spent the remainder of the week getting back into the Open Corpus Workbench server that I am working on with Stevie Barrett in Celtic. My main aim this week was to get a large number of the Hansard texts Marc had given me uploaded into the corpus. As is often the case, this proved to be trickier than first anticipated. The Hansard texts have been tagged with Part of Speech , Lemma and Semantic Tags all set up nicely in tab delimited text files which have the <s> tags that the server needs too. They also include a lot of XML tags containing metadata that can be used to provide limiting options in the restricted query. Unfortunately the <s> tabs have been added to the existing XML files in a rather brutal manner – stuck in between tags, at the start of the file before the initial XML definition etc. This means the files are very far from being valid XML.
I was intending to develop an XSLT script that would reformat the texts for input but XSLT requires XML input files to be well formed, so that idea was a no go. I decided instead to read the files into PHP and to split the files up by the <s> tag, processing the contents of each section in turn in order to extract the metadata we want to include and the actual text we want to be logged in the corpus. As the <s> tags were placed so arbitrarily it was very difficult to develop a script that caught all possible permutations. However, by the end of the week I had constructed a script that could successfully process all 428 text files that Marc had given me (and will hopefully be able to cope with the remaining data when I get it). Next week I will update the script to complete the saving of the extracted metadata in a suitable text file and I will then attempt the actual upload to the corpus.
I’m afraid I have been unable to find the time this week to get started on the redevelopment of any of the STELLA applications. Once Hansard is out of the way next week I should hopefully have the time to get started on these in earnest.
Back from my week off this week, with lots to do. Whilst I was away Justin Livingstone of SCS got in touch with HATII about a project proposal related to David Livingstone. Thankfully my colleagues in HATII were able to provide Justin with some helpful advice before his deadline, and I contacted Justin to offer my services if he needs them in future.
Unfortunately it is this plugin that doesn’t work with Chrome and IE – see this test page: http://dev.abiss.gr/sarissa/test/testsarissa.html. Sarissa is needed because different browsers have different processors for working with XML and XSLT files. But although it set out to solve some of the incompatibility problems it has introduced others. These days it’s generally considered a bit messy to rely on client-side browsers to process XSLT – far better to process the XML using XSLT on the server and then use AJAX to pull in the transformed text from the server and display it. This is what I set out to do with the advanced search page.
I had been hoping to work some more on the Open Corpus Workbench server this week, plus make a start on the STELLA applications but due to getting bogged down in the SCOTS advanced search both of these tasks will have to wait until next week.
I did manage to do some other tasks this week, however. As mentioned earlier, I had a meeting with Wendy, Marc and Jane to discuss Corpora in general at the university. This was a useful meeting and it was especially interesting to hear about the sound corpora that Jane is working with and the software her team are using (something called LabCat). I’ll need to investigate this further.
Also this week I had a further meeting with Alison Wiggins regarding possible mobile and tablet versions of Bess of Hardwick. We had a bit of a brainstorming session and came up with some possible ideas for the tablet version that could form the basis of a bid. We also got in touch with Sheffield to ask about access to the server in order to develop a simple mobile interface to the existing site. Although it won’t be possible for me to get access to the server directly, we reached an agreement whereby I will develop the interface around static HTML pages here in Glasgow and then I’ll send my updated interface to Sheffield for inclusion.
My final task of the week was to migrate the ‘Editing Robert Burns for the 21st Century’ website to a more official sounding URL. I haven’t quite managed to complete this task yet but I’ll get it all sorted on Monday next week.
I had a great meeting with the ‘Editing Burns for the 21st century’ people this week. We’d been trying to arrange a meet-up since the beginning of September but it was difficult to get a time that suited everyone until this week. I’m really looking forward to working on the project and I’ve already started making suggestions about their website and some possible ways in which web resources can be best exploited to throw the spotlight on their project outputs.
I also had another very productive corpus-focussed meeting with Stephen Barrett this week. We spent a couple of hours working through some of the issues we’d been encountering with the Open Corpus Workbench and made some real progress, specifically to do with character encoding issues (we’ve now upgraded to the most recent version of the CWB so we finally have proper UTF-8 support) and text-level metadata issues (it’s now possible to specify metadata about texts and for this to be used as the basis for limiting searches using the ‘restricted query’ option). I’ve started working with Marc’s Hansard texts and have so far managed to import one test text complete with one metadata category. This may seem unimpressive but a lot of the issues that will be encountered when importing the entire body of texts have been resolved when processing this one single text file so I feel like real progress has been made.
Also this week I completed my mock-ups for the App and web versions of the five STELLA applications that we have identified to be the initial focus for redevelopment. I also had a meeting with Marc to discuss STELLA matters in general, which was very helpful too. I will begin working on the actual HTML5 based websites for these applications the week after next. Whilst making the mock-ups I encountered some rather odd behaviour with HTML5 Audio and IE9. Although IE9 fully supports the HTML5 Audio tag (which when used presents an audio player within the web page without any plug-in being required) I just couldn’t get the player to appear on my test file that I was hosting on a University server, even though the exact same code worked perfectly on my desktop machine using the same browser. This was very frustrating and eventually I worked out that the discrepancy was being caused by IE9’s ‘Compatibility view’. For some odd reason IE9 on the standard staff desktop is set to view pages within the university domain using ‘compatibility view’, which basically emulates an older version of IE that doesn’t support HTML5! It’s most frustrating and as of yet I can only find a way to override this manually at the client end.
I also spoke again to Alison this week about the possible developments for Bess of Hardwick and we’re going to meet up the week after next to take this further. And I also met up with Ann, Marc and Graeme to discuss the redevelopment of the Digital Humanities Network website (http://www.digital-humanities.arts.gla.ac.uk/) that Graeme created a couple of years ago. We had a good discussion about the sorts of features that a revamp should incorporate and it was agreed that I would press ahead with the update, with Graeme providing some additional help as time allows. We’re hoping to have a new version ready to go by the middle of next month.
Also this week I finally got access to the SCOTS server so I will be able to look into the problems with the Advanced Search when using Chrome and IE the week after next too.
Note that I will be on holiday next week and will be back at work on Monday the 29th October.