I divided my time this week primarily between three projects: REELS, The People’s Voice and the Mapping Metaphor follow-on project. For REELS I continued with the content management system. After completing the place-name element management systems last week I decided this week to begin to tackle the bigger issue of management scripts for place-names themselves. This included migrating parish details into the database from a spreadsheet that Eila had previously sent me and migrating the classification codes from the Fife place-name database. I began work on the script that will process the addition of a new place-name record, creating the form that project staff will fill in, including facilities to add any number of map sheet records.
I initially included facilities to associate place-name elements with this ‘add’ form, which proved to be rather complicated. A place-name may have any number of elements and these might already exist in our element database. I created an ‘autocomplete’ facility whereby a user starts to type an element and the system queries the database and brings back a list of possible matching items. This was complicated by the fact that elements have different languages, and the list that’s returned should be different depending on what language has been selected. There are also many fields that the user needs to complete for each element, more so if the element doesn’t already exist in the database. I began to realise that including all of this in one single form would be rather too overwhelming for users and decided instead to split the creation and management of place-names across multiple forms. The ‘Add’ page would allow the user to create the ‘core’ record, which wouldn’t include place-name elements and historical forms. These materials will instead be associated with the place-name via the ‘browse place-names’ table, with separate pages specifically for elements and historical forms. Hopefully this set-up will be straightforward to use.
For The People’s Voice project I had an email conversation with the RA Michael Shaw about the structure of the database. Michael had met with Catriona to discuss the documentation I had previously created relating to the database and the CSV template form. Michael had sent me some feedback and this week I created a second version of the database specification, the template form and the accompanying guidelines based on this feedback. I think we’re pretty much in agreement now on how to proceed and next week I hope to start on the content management system for the project.
For Metaphor in the Curriculum I continued with my work to port all of the visualisation views from relying on server-side data and processing to a fully client-side model instead. Last week I had completed the visualisation view and had begun on the tabular view. This week I managed to complete the tabular view, the card view and also the timeline view. Although that sentence was very quick to read, actually getting all of this done took some considerable time and effort, but it is great to get it all sorted, especially as I had some doubts earlier on as to whether it would even be possible. I still need to work on the interface, which I haven’t spent much time adapting for the App yet. I also managed to complete the textual ‘browse’ feature this week as well, using jQuery Mobile’s collapsible lists to produce an interface that I think works pretty well. I still haven’t tackled the search facilities yet, which is something I hope to start on next week.
In addition to this I attended a meeting with the Burns people, who are working towards publishing a new section on the website about song performance. We discussed where the section should go, how it should function and how the materials will be published. It was good to catch up with the team again. I also had a chat with David Shuttleton about making some updates to the Cullen online resource, which I am now responsible for. I spent a bit of time going through the systems and documentation and getting a feel for how it all fits together. I also made a couple of small tweaks to the Medical Humanities Network website to ensure that people who sign up have some connection to the University.
I was on holiday on Monday so it was a four-day week for me, which was nice but it meant I had rather a lot to try to squeeze in. I mainly worked on four projects this week, which I will run through in no particular order. On Wednesday I met with Michael Shaw, the RA on the ‘People’s Voice’ project. The project is getting to the stage now where they really need their research database in place so Michael and I met to talk about the sorts of data they need to record about the poems they are researching and what sort of collection method would be suitable. We agreed that having an online content management system and database would work well in most situations, but there would be times when researchers would not be able to guaranteed internet access and would need to work offline. I suggested that I could create a CSV based template file that could be filled in using Excel when an internet connection was unavailable. A simple ‘drag and drop’ script would then allow rows from this file to be integrated with the main database once a connection could be established. We talked through the structure of the data and following on from our meeting I created a first version of a database specification document, a CSV template file and a set of guidelines for filling in the template. Michael has already started conducting research at the Mitchell so will be able to use the template until the content management system is ready for use. My documents were sent on to the rest of the team for feedback. There will no doubt be a bit of toing and froing over the structure of the data, but once that has been sorted out I will be able to start work on the database.
My second project this week was REELS. The project met last week to finalise the database specification, and following on from this I had created the underlying database for the project. This week I began to create the content management system that will sit on top of this database. As I’ve already created many such systems before it was pretty straightforward to set up the basic structure for the CMS, such as logins, layout templates etc. After that I focussed on the place-name element side of the system. We need facilities to create, list, edit and delete place-name elements and I managed to get all of these facilities set up with little difficulty. Carole had also suggested that the project reuse the elements from the Fife place-names project so I created a script that would migrate and clean-up the element data from this project. The script stripped out any tags and timed white spaces and ensured that each distinct element was only recorded once. A total of 1752 elements have been transferred, with 4 languages represented, plus existing and proper names. REELS wants to record data for elements that was not present in the Fife data, such as part of speech, a URL and a description, so these fields are currently blank and will need to be filled in as the project proceeds.
My third project was SciFiMedHums, for which I completed work on the facilities to allow users of the site to suggest new bibliographic items and for administrators to manage these submissions. Users need to register and log into the site in order to post suggested items and the submission form allows the user to enter an item title, select a type and medium and select one or more themes from the list of existing themes. There is also an additional ‘comments’ box where users can supply additional information, such as suggestions for new themes. I’ve updated the item database to include new columns for whether an item was submitted by a user of the site, what the status of the item is (approved, not approved or deleted), who submitted the item and their comments.
Upon submission the details are as a new item, but with ‘user submit’ set to ‘Y’ and ‘status’ set to ‘not approved’ and an email is sent to the project email address. The admin interface now has a new page called ‘List Pending Contributions’ which lists tems that have been submitted by users but have not yet been approved. From this list an Admin user can view the submitted data and decide whether to add it to the main item list (set ‘status’ to ‘approved’) or to delete it (set ‘status’ to ‘deleted’).
If status for an item is set to ‘deleted’ the item is removed from the list and won’t appear anywhere else in the system. If the status for an item is set to ‘approved’ the item will be removed from the list and will instead appear in the main ‘SFMH Bibliography’ list. This also generates a WordPress ‘post’ for the item, thus fully integrating the item with our custom post type. The feature is not ‘live’ yet, as I am awaiting feedback from Gavin before I do that.
My fourth project of the week was Metaphor in the Curriculum, the Mapping Metaphor follow-on project. The original Mapping Metaphor project has been nominated for ‘Best DH Data Visualisation’ in the DH Awards 2015. Voting is open to everyone so if you would like to vote for the project you can do so here: http://dhawards.org/dhawards2015/voting/
I managed to get the visualisation view of the data completed this week, including both aggregate and drilldown views and the metaphor cards for each. I also started work on the tabular view of the data, although I haven’t managed to get this working yet. There is still so much to do, but hopefully I will continue to make good progress with this next week.
It’s been another busy week, but I have to keep this report brief as I’m running short of time and I’m off next Monday. I came into work on Monday to find that the script I had left executing on the Grid to extract all of the Hansard data had finished working successfully! It left me with a nice pile of text files containing SQL insert statements – about 10Gb of them. As we don’t currently have a server on which to store the data I instead started a script executing that runs each SQL insert command on my desktop PC and puts the data into a local MySQL database. Unfortunately it looks like it’s going to take a horribly long time to process the data. I’m putting the estimate at about 229 days.
My arithmetic skills are sometimes rather flaky so here’s how I’m working out the estimate. My script is performing about 2000 inserts a minute. There are about 1200 output files and based on the ones I’ve looked at they contain about 550,000 lines each. 550,000 x 1200 = 660,000,000 lines in total. This figure divided by 2000 gives the number of minutes it would take (330,000). Divide this by 60 gives the number of hours (5,500). Divide this by 24 gives the number of days (229). My previous estimate for doing all of the processing and uploading on my desktop PC was more than 2 years, so using the Grid has speeded things up enormously, but we’re going to need something more than my desktop PC to get all of the data into a usable form any time soon. Until we get a server for the database there’s not much more I can do.
On Tuesday this week we had a REELS team meeting where we discussed some of the outstanding issues relating to the structure of the database (amongst other things). This was very useful and I think we all now have a clear idea of how the database will be structured and what it will be able to do. After the meeting I wrote up and distributed an updated version of my database specification document and I also worked with some map images to create a more pleasing interface for the project website (it’s not live yet though, so no URL). Later in the week I also created the first version of the database for the project, based on the specification document I’d written. Things are progressing rather nicely at this stage.
I spent a bit of time fixing some issues that had cropped up with other projects. The Medical Humanities Network people wanted a feature of the site tweaked a little bit, so I did this. I also fixed an issue with the lexeme upload facility of the Scots Corpus, which was running into some maximum form size limits. I had a funeral to attend on Thursday afternoon so I was away from work for that.
I worked on several projects this week. I continued to refine the database and content management system specification document for the REELS project. Last week I had sent an initial version out to the members of the team, who each responded with useful comments. I spent some time considering their comments and replying to each in turn. The structure of the database is shaping up pretty nicely now and I should have a mostly completed version of the specification document written next week before our next project meeting.
I also met with Gary Thoms of the SCOSYA project to discuss some unusual behaviour he had encountered with the data upload form I had created. Using the form Gary is able to drag and drop CSV files containing survey data, which then pass through some error checking and are uploaded. Rather strangely, some files were passing through the error checks but were uploading blank data, even though the files themselves appeared to be in the correct format and well structured. Even more strangely, when Gary emailed one of the files to me and I tried to upload it (without even opening the file) it uploaded successfully. We also worked out that if Gary opened the file and then saved it on his computer (without changing anything) the file also uploaded successfully. Helpfully, the offending CSV files don’t display with the correct CSV icon on Gary’s Macbook so it’s easy to identify them. There must be some kind of file encoding issue here, possibly caused by passing the file from Windows to Mac. We haven’t exactly got to the bottom of this, but at least we’ve figured out how to avoid it happening in future.
On Friday I had a final project meeting for the Medical Humanities Network project. The meeting was really just to go over who will be responsible for what after the project officially ends, in order to ensure new content can continue to be added to the site. There shouldn’t really be too much for me to do, but I will help out when required. I also continued with some outstanding tasks for the SciFiMedHums project on Friday too. Gavin wants visitors to the site to be able to suggest new bibliographical items for the database and we’ve decided that asking them to fill out the entire form would be too cumbersome. Instead we will provide a slimmed down form (item title, medium, themes and comments) and upon submission an Editor will then be able to decide if the item should be added to the main system and if so manage this through the facilities I’ll develop. On Friday I figured out how the system will function and began implementing things on a test server I have access to. So far I’ve updated the database with the new fields that are required, added in the facilities to enable visitors to the site to log in and register and I’ve created the form that users will fill in. I still need to write the logic that will process the form and all of the scripts the editor will use to process things, which hopefully I’ll find time to tackle next week.
I continued to work with the Hansard data for the SAMUELS project this week as well. I managed to finish the shell script for processing one text file, which I had started work on last week. I managed to figure out how to process the base64 decoded chunk of data that featured line breaks, allowing me to extract and process an individual code / frequency pairing. I then figured out a way to write each line of data to an output text file. The script now takes one of the input text files that contain 5000 lines of base64 encoded code / frequency data and for each code / frequency pair it writes an SQL statement to a text file. I tested the script out and ensured that the resulting SQL statements worked with my database and after that I contacted Gareth Roy in Physics, who has been helping to guide me through the workings of the Grid. Gareth provided a great deal of invaluable help here, including setting up space on the Grid, writing a script that would send jobs for each text file to the nodes for processing, updating my shell script so that the output text file location could be specified and testing things out for me. I really couldn’t have got this done without his help. On Friday Gareth submitted an initial test batch of 5 jobs, and these were all processed successfully. As all was looking good I then submitted a further batch of jobs for scripts 6 to 200. These all completed successfully by late on Friday afternoon. Gareth then suggested I submit the remaining files to be processed over the weekend so I did. It’s all looking very promising indeed. The only possible downside is that as things currently stand we have no server on which to store the database for all of this data. This is why we’re outputting SQL statements in text files rather than writing directly to a database. As there will likely be more than 100 million SQL insert statements to process we are probably going to face another bottleneck when we do actually have a database in which to house the data. I need to meet with Marc to discuss this issue.
I spent a fair amount of time this week working on the REELS project, which began last week. I set up a basic WordPress powered project website and got some network drive space set up and then on Wednesday we had a long meeting where we went over some of the technical aspects of the project. We discussed the structure of the project website and also the structure of the database that the project will require in order to record the required place-name data. I spent the best part of Thursday writing a specification document for the database and content management system which I sent to the rest of the project team for comment on Thursday evening. Next week I will update this document based on the team’s comments and will hopefully find the time to start working on the database itself.
I met with a PhD student this week to discuss online survey tools that might be suitable for the research that she was hoping to gather. I heard this week from Bryony Randall in English Literature that an AHRC proposal that I’d given her some technical advice on had been granted funding, which is great news. I had a brief meeting with the SCOSYA team this week too, mainly to discuss development of the project website. We’re still waiting on the domain being activated, but we’re also waiting for a designer to finish work on a logo for the project so we can’t do much about the interface for the project website until we get this anyway.
I also attended the ‘showcase’ session for the Digging into Data conference that was taking place at Glasgow this week. The showcase was an evening session where projects had stalls and could speak to attendees about their work. I was there with the Mapping Metaphor project, along with Wendy, Ellen and Rachael. We had some interesting and at times pretty in-depth discussions with some of the attendees and it was a good opportunity to see the sorts of outputs other projects have created with their data.
Before the event I went through the website to remind myself of how it all worked and managed to uncover a bug in the top-level visualisation: When you click on a category yellow circles appear at the categories the one you’ve clicked on have a connection to. These circles represent the number of metaphorical connections between the two categories. What I noticed was that the size of the circles was not taking into consideration the metaphor strength that had been selected, which was giving confusing results. E.g. if there are 14 connections but only one of these is ‘strong’ and you’ve selected to view only strong metaphors the circle size was still being based on 14 connections rather than one. Thankfully I managed to track down the cause of the error and I fixed it before the event.
I also spent a little bit of time further investigating the problems with the Curious Travellers server, which for some reason is blocking external network connections. I was hoping to install a ‘captcha’ on the contact form to cut down on the amount of spam that was being submitted and the Contact Form 7 plugin has a facility to integrated Google’s ‘reCaptcha’ service. This looked like it was working very well, but for some reason when ‘reCaptcha’ was added to forms these forms failed to submit, instead giving error messages in a yellow box. The Contact Form 7 documentation suggests that a yellow box means the content has been marked as spam and therefore won’t send, but my message wasn’t spam. Removing ‘reCaptcha’ from the form allowed it to submit without any issue. I tried to find out what was causing this but have been unable to find an answer. I can only assume it is something to do with the server blocking external connections and somehow failing to receive a ‘message is not spam’ notification from the service. I think we’re going to have to look at moving the site to a different server unless Chris can figure out what’s different about the settings on the current one.
My final project this week was SAMUELS, for which I am continuing to work on the extraction of the Hansard data. Last week I figured out how to run a test job on the Grid and I split the gigantic Hansard text file into 5000 line chunks for processing. This week I started writing a shell script that will be able to process these chunks. The script needs to do the same tasks as my initial PHP script, but because of the setup of the Grid I need to write a script that will run directly in the Bash shell. I’ve never done much with shell scripting so it’s taken me some time to figure out how to write such a script. So far I have managed to write a script that takes a file as an input, goes through each line at a time, splits the line up into two sections based on the tab character, base64 decodes each section and then extracts the parts of the first section into variables. The second section is proving to be a little trickier as the decoded content includes line breaks which seem to be ignored. Once I’ve figured out how to work with the line breaks I should then be able to isolate each tag / frequency pair, write the necessary SQL insert statement and then write this to an output file. Hopefully I’ll get this sorted next week.