I worked on several projects this week. I continued to refine the database and content management system specification document for the REELS project. Last week I had sent an initial version out to the members of the team, who each responded with useful comments. I spent some time considering their comments and replying to each in turn. The structure of the database is shaping up pretty nicely now and I should have a mostly completed version of the specification document written next week before our next project meeting.
I also met with Gary Thoms of the SCOSYA project to discuss some unusual behaviour he had encountered with the data upload form I had created. Using the form Gary is able to drag and drop CSV files containing survey data, which then pass through some error checking and are uploaded. Rather strangely, some files were passing through the error checks but were uploading blank data, even though the files themselves appeared to be in the correct format and well structured. Even more strangely, when Gary emailed one of the files to me and I tried to upload it (without even opening the file) it uploaded successfully. We also worked out that if Gary opened the file and then saved it on his computer (without changing anything) the file also uploaded successfully. Helpfully, the offending CSV files don’t display with the correct CSV icon on Gary’s Macbook so it’s easy to identify them. There must be some kind of file encoding issue here, possibly caused by passing the file from Windows to Mac. We haven’t exactly got to the bottom of this, but at least we’ve figured out how to avoid it happening in future.
On Friday I had a final project meeting for the Medical Humanities Network project. The meeting was really just to go over who will be responsible for what after the project officially ends, in order to ensure new content can continue to be added to the site. There shouldn’t really be too much for me to do, but I will help out when required. I also continued with some outstanding tasks for the SciFiMedHums project on Friday too. Gavin wants visitors to the site to be able to suggest new bibliographical items for the database and we’ve decided that asking them to fill out the entire form would be too cumbersome. Instead we will provide a slimmed down form (item title, medium, themes and comments) and upon submission an Editor will then be able to decide if the item should be added to the main system and if so manage this through the facilities I’ll develop. On Friday I figured out how the system will function and began implementing things on a test server I have access to. So far I’ve updated the database with the new fields that are required, added in the facilities to enable visitors to the site to log in and register and I’ve created the form that users will fill in. I still need to write the logic that will process the form and all of the scripts the editor will use to process things, which hopefully I’ll find time to tackle next week.
I continued to work with the Hansard data for the SAMUELS project this week as well. I managed to finish the shell script for processing one text file, which I had started work on last week. I managed to figure out how to process the base64 decoded chunk of data that featured line breaks, allowing me to extract and process an individual code / frequency pairing. I then figured out a way to write each line of data to an output text file. The script now takes one of the input text files that contain 5000 lines of base64 encoded code / frequency data and for each code / frequency pair it writes an SQL statement to a text file. I tested the script out and ensured that the resulting SQL statements worked with my database and after that I contacted Gareth Roy in Physics, who has been helping to guide me through the workings of the Grid. Gareth provided a great deal of invaluable help here, including setting up space on the Grid, writing a script that would send jobs for each text file to the nodes for processing, updating my shell script so that the output text file location could be specified and testing things out for me. I really couldn’t have got this done without his help. On Friday Gareth submitted an initial test batch of 5 jobs, and these were all processed successfully. As all was looking good I then submitted a further batch of jobs for scripts 6 to 200. These all completed successfully by late on Friday afternoon. Gareth then suggested I submit the remaining files to be processed over the weekend so I did. It’s all looking very promising indeed. The only possible downside is that as things currently stand we have no server on which to store the database for all of this data. This is why we’re outputting SQL statements in text files rather than writing directly to a database. As there will likely be more than 100 million SQL insert statements to process we are probably going to face another bottleneck when we do actually have a database in which to house the data. I need to meet with Marc to discuss this issue.