I devoted the beginning of this week to corpus matters, continuing to work with the Hansard texts that I spent a couple of days looking into last week. By the end of last week I had managed to write a PHP script that could read all 428 sample Hansard texts and hopefully spit them out in a format that would be suitable for upload into our Open Corpus Workbench server.
I ran into a few problems when I came to upload these files on Monday morning. Firstly, I hadn’t wrapped each text in the necessary <text id=””> tags, something that was quickly rectified. I then had to deal with selecting all 400-odd files for import. The front-end lists all files in the upload area with a checkbox beside each, and no ‘select all’ option. Thankfully the Firefox developer’s toolbar has an option allowing you to automatically tick all checkboxes, but unfortunately the corpus front-end posts forms using the GET rather than POST method, so any form variables are appended to the URL when the request is sent to the server. A URL with 400 filenames attached is too long for the server to process and results in an error so it was back to the drawing board. Thankfully a solution presented itself to me fairly quickly: You don’t need to have a separate text file for every text in your corpus, you can have any number of texts bundled together into one text file, separated by those handy <text> tags. A quick update of my PHP script later and I had one 2.5mb text file rather than 428 tiny text files, and this was quickly and successfully imported into the corpus server, with part-of-speech, lemma and semantic tags all present and correct.
After dealing with the texts came the issue of the metadata. Metadata is hugely important if you want to be able to restrict searches to particular texts, speakers, dates etc. For Hansard I identified 9 metadata fields from the XML tags stored in the text files: Day, Decade, Description, Member, Member Title, Month, Section, URL, Year. I created another PHP script that read through the sample files and extracted the necessary information and create a tab-delimited text file with one row per input file and one column per metadata item. It wasn’t quite this simple though as the script also had to create IDs for each distinct metadata value and use these IDs in the tab-delimited file rather than the values, as this is what the front-end expects. In the script I held the actual values in arrays so that I could then use these to insert both the values and their corresponding IDs directly into the underlying database after the metadata text file was uploaded.
Metadata file upload took place in several stages:
- The tab delimited text file with one row per text and one column per metadata element (represented as IDs rather than values) was uploaded.
- A form was filled in telling the system the names of each metadata column in the input file
- At this point the metadata was searchable in the system, but only using IDs rather than actual values. E.g. You could limit your search by ‘Member’ but each member was listed as a number rather than an actual name.
- It is possible to use the front end to manually specify which IDs have which value, but as my test files had more than 200 distinct metadata values this would have taken too long.
- So instead I created a PHP script that inserted these values directly into the database
- After doing the above restricted queries worked. E.g. You can limit a search to speaker ‘Donald Dewar’ or topic ‘Housing (Scotland)’ or a year or decade (less interesting for the sample texts as they are all from one single day!)
After I finished the above I spent the majority of the rest of the week working on the new pages for the Digital Humanities Network. The redevelopment of these pages has taken a bit longer than I had anticipated as I decided to take the opportunity to get to grips with PHP data objects (http://php.net/manual/en/book.pdo.php) as a more secure means of executing database queries. It has been a great opportunity to learn more about these, but it has meant I’ve needed to redevelop all of the existing scripts to use this new method, and also spend more time than I would normally need getting the syntax of the queries right. By the end of the week I had just about completed every feature that was discussed at the Digital Humanities meeting a few weeks ago. I should be able to complete the pages next Monday and will email the relevant people for feedback after that.
I also spent a little time this week reading through some materials for the Burns project and thinking about how to implement some of the features they require.
I still haven’t managed to make a start on the STELLA apps but the path should now be clear to begin on these next week (although I know it’s not the first time I’ve said that!)