Week Beginning 25th July 2016

This week was another four-day week for me as I’d taken the Friday off.  I will also be off until Thursday next week.  I was involved in a lot of different project and had a few meeting this week.  Wendy contacted me this week with a couple of queries regarding Mapping Metaphor.  One part of this was easy – adding a new downloadable material to the ‘Metaphoric’ website.  The involved updating the ZIP files and changing a JSON file to make the material findable in the ‘browse’ feature. The other issue was a bit more troublesome.  In the Mapping Metaphor ‘browse’ facilities in the main site, the OE site and the ‘Metaphoric’ site Carole had noticed that the number of metaphorical connections given for the top level categories didn’t match up with the totals given for the level two categories within these top level ones.  E.g. Browse view gives the External World total as 13115, but adding up the individual section totals comes to 17828.

It took quite a bit of investigation to figure out what was causing this discrepancy.  But I finally figured out how to make the totals consistent and applied the update to the main site, the OE site and the Metaphoric website (but not the app as I’ll need to submit a new version to the stores to get the change implemented here).

There were inconsistencies in the totals at both the top level and level 2.  These were caused by metaphorical connections that include links within a category only being counted once (e.g. a connection from Category 1 to Category 2 counts as 2 ‘hits’ – one for Category 1 and another for Category 2 but a connection from Category 1 to another Category 1 only counts as one ‘hit’.  This was also true for Level 2 categories – e.g. 1A to 1B is a ‘hit’ for each category but 1A to another 1A is only one ‘hit’.

It could be argued that this is an acceptable way to count things, but in our browse page we have to go from the bottom up as we display the number of metaphorical connections each Level 3 category is involved in.  Here’s another example:

2C has 2 categories, 2C01 and 2C02.  2C01 has 127 metaphorical connections and 2C02 has 141, making a total of 268 connections.  However, one of these connections is between 2C01 and 2C02, so in the Level 2 count ‘how many connections are there involving a 2C category in either cat1 or cat2’ this connection was only being counted once, meaning the 2C total was only showing 267 connections instead of 268.

It could be argued that 2C does only have 267 metaphorical connections, but as our browse page shows the individual number of connections for each Level 3 category we need to include these ‘duplicates’ otherwise the numbers for levels 1 and 2 don’t match up.

Perhaps using the term ‘metaphorical connections’ on the browse page is misleading.  We only have a total of 15,301 ‘metaphorical connections’ in our database.  What we’re actually counting on the browse page is the number of times a category appears in a metaphorical connection, as either cat1, cat2 or both.  But at least the figures used are now consistent.

On Monday I had a meeting with Gary Thoms to discuss further developments of the Content Management System for the SCOSYA project.  We agreed that I would work on a number of different tasks for the CMS.  This includes adding a new field to the template and ensuring the file upload scripts can process this, adding a facility to manually enter a questionnaire into the CMS rather than uploading a spreadsheet, adding example sentences and ‘attributes’ to the questionnaire codes and providing facilities in the CMS for these to be managed and creating some new ‘browse’ facilities to access the data.  It was a very useful meeting and after writing up my notes from it I set to work on some of the tasks.  By the end of my working week I had updated the file upload template, the database and the pages for viewing and editing questionnaires in the CMS.  I had also created the database tables and fields necessary for holding information about example sentences and attributes and I created the ‘add record’ facility.  There is still quite a lot to do here, and I’ll return to this after my little holiday.  I’ll also need to get started on the actual map interface for the data too – the actual ‘atlas’.

On Tuesday I had a meeting with Rob Maslen to discuss a new website he wants to set up to allow members of the university to contribute stories and articles involving fantasy literature.  We also discussed his existing website and some possible enhancements to this.  I’ll aim to get these things done over the summer.

Last week Marc had contacted me about a new batch of Historical Thesaurus data that had been sent to us from the OED people and I spent a bit of time this week looking at the data.  The data is XML based and I managed to figure out how it all fits together but as of yet I’m having trouble seeing how it relates to our HT data.

For example ‘The Universe (noun)’ in the OED data has an ID of 1628 and a ‘path’ of ‘01.01’, which looks like it should correspond to our hierarchical structure, but in our system ‘The Universe (noun)’ has the number ‘01.01.10 n’.  Also the words listed in the HT data for this category are different to ours.  We have the Old English words, which are not part of the OED data, but there are other differences too, e.g. The OED data has ‘creature’ but this is not in the HT data.  Dates are different too, e.g. in our data ‘World’ is ‘1390-‘ while in the OED data it’s ‘?c1200’.

It doesn’t look to me like there is anything in the XML that links to our primary keys – at least not the ones in the online HT database.  The ID in the XML for ‘The Universe (noun)’ is 1628 but in our system the ID for this category is 5635.  The category with ID 1628 in our system is ‘Pool :: artificially confined water :: contrivance for impounding water :: weir :: place of’ which is rather different to ‘The Universe’!

I’ve also checked to see whether there might by an ID for each lexeme that is the same as our ‘HTID’ field (if there was then we could get to the category ID from this) but alas there doesn’t seem to be either.  For example, the lexeme ‘world’ has a ‘refentry’ of ‘230262’ but this is the HTID for a completely different word in our system.  There are ‘GHT’ (Glasgow Historical Thesaurus) tags for each word but frustratingly an ID isn’t one of them – only original lemma, dates and roget category.  I hope aligning the data is going to be possible as it’s looking more than a little tricky from my initial investigation.  I’m going to meet with Marc and Fraser later in the summer to look into this in more detail.

On Wednesday I met with Rhona Brown from Scottish Literature to discuss a project of hers that is just starting and that I will be doing the technical work for.  The project is a small grant funded by the Royal Society of Edinburgh and its main focus is to create a digital edition of the Edinburgh Gazetteer, a short-lived but influential journal that was published in the 1790s.

The Mitchell has digitised the journal and this week I managed to see the images for the first time.  Our original plan was to run the images through OCR software in order to get some text that would be used behind the scenes for search purposes, with the images being the things the users will directly interact with.  However, now I’ve seen the images I’m not so sure this approach is going to work as the print quality of the original materials is pretty poor.  I tried running one of the images  through Tesseract, which is the OCR engine Google uses for its Google Books project and the results are not at all promising.  Practically every word is wrong, although it looks like it has at least identified multiple columns – in places anyway.  However, this is just a first attempt and there are various things I can do to make the images more suitable and possibly to ‘train’ the OCR software too.  I will try other OCR software as well.  We are also going to produce an interactive map of various societies that emerged around this time so I created an Excel template and some explanatory notes for Rhona to use to compile the information.  I also contacted Chris Fleet of the NLS Maps department about the possibility of reusing the base map from 1815 that he very kindly helped us to use for the Burns highland tour feature.  Chris got back to me very quickly to say this would be find, which is great.

On Wednesday I also met with Frank Hopfgartner from HATII to discuss an idea he has had to visualise a corpus of German radio plays.  We discussed various visualisation options and technologies, the use of corpus software and topic modelling and hopefully some of this was useful to him.  I also spent some time this week chatting to Alison Wiggins via email about the project she is currently putting together.  I am going to write the Technical Plan for the proposal so we had a bit of a discussion about the various technical aspects and how things might work.  This is another thing that I will have to prioritise when I get back from my holidays.  It’s certainly been a busy few days.

Week Beginning 18th July 2016

Monday was a holiday this week so I returned to work on Tuesday, after being out of the office for most of the past three weeks on holidays and at the DH2016 conference.  A lot of the week was spent catching up with emails and finishing off conference related things, such as writing last week’s lengthy blog post that summarised the conference parallel sessions I attended.  I also had to submit my travel expenses and get my remaining Zlotys changed back.  Other than these tasks the rest of my week was spend on a range of relatively small tasks.  I continued to work with the Hansard data extraction using the ScotGrid infrastructure.  By the end of the week the total number of rows extracted and inserted into the MySQL database stood at 123,636,915, and that’s with only 170 files out of over 1,200 processed.

I spent a little bit of time discussing the dreaded H27 issue for the Old English data of the Mapping Metaphor project.  Wendy and Ellen have been having a chat about this and it looks like they’ve come up with a plan to get the data sorted.  Carole is going to use the content management system I created for the project in order to add in the stage 5 data for the H27 categories.  Once this is in place I will then be able to extract this data and pass it over to Flora so she can integrate it with the rest of the data in her Access database.  Here’s hoping this strategy will work.

I also had a chat with Gary Thoms about the SCOSYA project and added some new codes to the project database for him.  We will be meeting next week to go over plans for the next stage of technical development for the project, but Gary wanted to check a few things out before this, such as whether it would be possible to allow the editors to create records directly through the system rather than uploading CSV files.

I also responded to a request for help from someone in the School of Social and Political Sciences about an interactive online teaching course she was wanting to put together.  As I only really work within the School of Critical Studies I couldn’t really get involved too much, but I suggested she speak to the University’s MOOC people as a MOOC (Massive Open Online Course) seemed to be very similar to what she had in mind.  I also spend some time in an email conversation with Christine Ferguson and a technical person at Stirling University.  Christine has a project starting up and I was supposed to get the project website up and running over the summer.  However, Christine is started a new post at Stirling and the project needs to move with her.  After a bit of toing and froing we managed to come up with a plan of action for setting up the website at Stirling, and that should be the end of my involvement with the project, all being well.

Ann Ferguson of Scottish Language Dictionaries contacted me whilst I was on holiday about doing some further work on the DSL website so I also spent a bit of time going through the materials she had sent me and getting back up to speed on the project.  There are a few outstanding tasks that we had intended to complete about 18 months ago that Ann would now like to see finalised so I replied to her about how we might go about this.

I also spoke to Rob Maslen about the student blog he is hoping to set up before next term.  I’m going to meet with him next week to figure out exactly what is required.  Finally, Marc sent me on some new data for the Historical Thesaurus that has come from the OED people.  We’re going to have to figure out how best to integrate this over the next couple of months, and it will be really great to have the updated data.

Week Beginning 11th July 2016

I returned to work this week after being on holiday for the past two weeks.  I was only in the office for one day this week, though, as the rest of the time I was attending the DH2016 conference in Krakow.  Most of Monday was spend printing off materials for the conference, checking in for flights, figuring out where my hotel was and things like that.  I also read through the abstracts of the parallel sessions to try to decide which sessions I should attend.  The DH conference is a particularly large one and at various times there were up to nine parallel sessions, each consisting of up to five papers so deciding what to attend was something of a mammoth undertaking – especially as there were several sessions happening at the same time that all appealed to me.

The parts of Monday that I didn’t spend preparing for the conference I spent working on materials for a proposal Jane Stuart-Smith was submitting at short notice. This project involves several other institutions and some non-SCS people within Glasgow and seems rather unwieldy, but certainly has potential.  I commented on the proposal documents and the materials were submitted in time for the call deadline on the Tuesday.

Tuesday for me was spent travelling to Krakow via Schiphol, along with a few other colleagues from Glasgow, namely Marc Alexander, Fraser Dallachy, Katie Lowe and Johanna Green.  The journey went pretty smoothly, although there was a delay of about 40 minutes when departing Schiphol.  We arrived in Krakow too late to attend the opening ceremony and reception, so after checking into our respective hotels we met up, explored the city and got some food.

Wednesday was the start of the main conference and at registration I received the customary goody bag, which for this event included a branded DH water bottle, t-shirt, pen and USB stick, amongst other things.  The first parallel session began at 9:30 and I attended the session on ‘Analysing and visualising networks’, which was a short paper session consisting of 5 papers. The first paper was about a project called Kinomatics (http://kinomatics.com) that is looking into diversity and reciprocity in the global flow of contemporary cinema.  The project is dealing with 330 million records, looking at every film screening in 48 countries – some 97000 films and 33000 venues.  The speaker pointed out that this is not ‘big’ data compared to some datasets but it is ‘big data’ in that it is ungraspable by conventional approaches.  The project is presenting its findings using some interesting visualisations, such as plotting screenings on a radar chart, with months in a year as spokes and number of screenings in each month plotted on these spokes.  The shape of the chart could then be compared for different films.  Data was also visualised spatially and using a stacked column chart to show the technical formats each film used.  Further visualisations demonstrated the flow of transfers in and out of countries and a network diagram showing the reciprocity of transfers between countries.  The speaker pointed out that to get a more accurate picture of the diversity of films in a country you need to look at their exposure – the actual number of screenings.  The project’s data comes from a company that provides the data to Google.  The data is usually deleted after a month but the project is archiving it and maintaining the dataset.

The second speaker discussed methods of analysing social structures that evolve over time.  Adding a temporal aspect to network diagrams to show the evolution of a network is a pretty interesting subject and the speaker chose as his subject marriage in the early Mormon church, which is a useful dataset due to the non-binary nature of some Mormon marriages and also because there are very good records available.  One interesting way the project visualised a marriage was to use a chord diagram, with husbands, wives and their children all round the edges, with individuals colour-coded based on gender and lines between individuals showing their relationships.  For example, a child had connecting lines to its mother and father if both were represented in the marriage.  New wives, husbands or children could then be easily added to the diagram with the space taken up by the other actors updating as required.  To represent the generational flow over time these individual chord diagrams were then placed within a flow diagram that represented time from left to right, with the option of zooming in on individual chord diagrams within this and viewing lines representing individual relationships between multiple chord diagrams.

The third paper looked at networks of confidentiality and secrecy in correspondences and again was interested in plotting relationships in networks across time, but this time also plotting the data spatially on a map of Europe, for example having nodes as locations where a letter was sent to / from and lines connecting the two, with different colours representing directionality and line thickness showing the number of letters.  This interface incorporated a time slider allowing the user to select a specific period of time to focus on – a start date slider and an end date slider so as to give greater flexibility.  The project used a system called ‘NodeGoat’ (https://nodegoat.net/) which was also used for another project later in the week.  It seems like an interesting tool, although it is proprietary and isn’t free to use or adapt.  Some example maps similar to the ones shown in this presentation can be found here: http://mnn.nodegoat.net/viewer

The fourth paper looked at record linkage when dealing with sparse historical data.  In historical data we have limited data and identifying people across different documents and linking these up can be tricky.  The paper looked at resolving name ambiguity by a combination of natural language processing, linking name mentions into a knowledgebase and looking at the context of the names within the documents.  For example, if there is a singer and a politician that have the same name the system should be able to work out by the context which person has been mentioned.  But with historical documents we have a lack of available knowledge. The project looked at early modern Venetian apprenticeship contracts – some 55000 contracts over 200 years.  A subset of these was manually annotated for things like name, gender, age, profession and geography to see how likely it is that the features can be used to match individuals and various network diagrams were shown to demonstrate the effectiveness of certain pairings.  The project aims to then integrate the identification system into transcription software to automate the process.

The final paper of the session looked at mapping the activity of the League of Nations based on correspondence from between the two world wars. The project wanted to map out the relations between people on the committee based on the authors and receivers of letters and the topics of letters.  The speaker showed some very nice network diagrams showing correspondence connections, with the overall diagram containing 30,000 documents, 3,200 nodes and 26,500 connections.  The speaker demonstrated how the high-level graph could be drilled down into and individuals or groups could be highlighted.  The speaker mentioned ‘betweenness centrality’ as a mechanism for working out where a node should be positioned within the network diagram.  There were some interesting questions after the paper, for example how the temporal aspect could be incorporated and how people within the League of Nations were not static but changed their role.  The visualisations are also not currently dynamic.

For the second parallel session I attended a further session on ‘Analysing and visualising networks’, which again was a series of short papers.  The first paper that was shown was actually the second paper of the session due to some technical issues.  This paper gave an overview of various network graph techniques that the speaker had used with his dataset, which was Jacobite poetry.  He pointed out that network graphs look great and people are generally impressed by them but there is a danger that people don’t actually understand them.  The network diagrams that were shown were all produced using Gephi, which I will have to investigate again at some point.

The second paper (which should have been the first) was about methods for identifying the main characters in novels in order to represent these on network graphs with characters as nodes, their relationships as the edges and a weighting based on the strength of the relationship.  As readers we do this easily but to automatically create networks is challenging – computers need metrics, not just ‘feelings’ about which are the central characters.  The speaker discussed ‘coreferece resolution’ as the means of knowing that people who are identified in different ways are actually the same person – for example when they are identified with pronouns.  An automated system also needs to work out where there are interactions between characters.  A crude way is to say that when two characters appear in the same paragraph then there is a direct communication between them.  A more complicated way is to extract direct speech and actually work out who talks to whom.  The speaker looked at whether the simple method might actually perform well enough.  His research discarded pronouns and performed ‘named entity recognition’ using summaries of 58 novels from Kindlers Literary Lexicon Online.  After running a series of experiments it would appear that the simpler methods work best.  There was an interesting question after the paper about how to determine what a character actually is.  If someone is merely mentioned are they a character?  If God is mentioned a lot is he a character?  The speaker said they got around this by merely focussing on the main characters.

The third paper showed how a researcher had developed a network analysis of the field of economic history in Australia over 20 years, and specifically how co-location and geographic proximity are important in the development of an academic field:  academics who work near each other tend to do more together and these connections grow stronger over multiple years.  The speaker looked at three different types of collaboration: co-publishing, contributions and sub-authorship, with the network data based on citations and much data gathering and weighting done manually.  The connections were mainly centred on three main cities and sub-authorship was the most geographically diverse.  The findings from the network showed that people moving between the cities has a significant impact in transferring and spreading ideas.

The fourth paper looked at building a network diagram of interactions between people as a means of exploring a corpus.  The speaker demonstrated some really nice interactive visualisations of people mentioned in the campaign speeches of JFK and Nixon (these were also used in another paper on a different day).  The underlying data was passed through the Stanford named entity processor to work out co-occurrence in order to build the relationships between people.  The data was then loaded into a very nice tool called ALCIDE, which can currently be accessed here: http://celct.fbk.eu:8080/Alcide_Demo/. The tool presents some very nice interactive visualisations including network diagrams built in D3 and heatmaps built in Leaflet, both of which also have a double-ended time slider to enable any particular range of time to be selected.  There are also full-text searches, name clouds, frequency graphs etc.  All of the interfaces are linked into the full text of the speeches too.  For the visualisation the speaker noted that HTML5 Canvas proved to be much faster than SVG, but that this was more tricky to implement in D3.  I’ll have to look into this.  For the actual case study of JFK vs Nixon the study found that there were more people mentioned by JFK – 1244 vs 486.  The speaker noted that there is still work to be done with identifying pronouns using natural language processing.

The final paper of the session looked at mapping out relationships between authors and identification of areas of knowledge within the DH community.  It used DH journals from the ADHO and Scopus in order to build co-authorship and co-citation network diagrams.  There appears to be little international collaboration in the field.

For the third and final parallel session of the day I attended the session on Crowdsourcing.  The first paper was about the experiences of the ‘Letters of 1916’ project (http://letters1916.maynoothuniversity.ie/), which was especially useful to hear about as there are a couple of potential crowdsourcing projects that I may be involved with at some point.  The speaker pointed out that the success of their project (as with all such projects) was based on engagement with users.  The project had around 15,000 registered users and users are able to upload their own images of letters in addition to transcribing existing letters.  The interface developed by the project uses Scripto (http://scripto.org/), which uses parts of MediaWiki to manage user contributions.  It is also integrated with the Omeka content management system, which I’ve been interested in using for some time.  Apparently the Transcribe Bentham project used the same kind of setup too.  Users are asked to perform light TEI coding (basically adding tags by pressing on buttons and filling in the blanks between tags).  The speaker stated that 71% of its users are women and that social media is important for engagement.  Mondays to Wednesdays were the best days for engagement and most transcription was done in the afternoon.  The speaker noted that people need to be interested in the subject, and that there were peaks of activity around important dates, such as the centenary of the Easter rising.  36% of users were from an HE background and most were over 55.  Publicity in ‘old media’ really helped to connect to such people rather than the use of social media.  Connections to existing organisations such as Volunteer Ireland also helped, as did participating in community events.  To keep people involved giving them feedback and descriptions of the content was important.  With regards to the TEI markup the advice from the project is to keep it simple and not to tell people it’s ‘hard’.  Users should be able to figure it out without relying on any documentation.  Some ‘superusers’ emerged who did a large amount of the transcriptions.  In terms of the workflow several people transcribed each page and all the changes were tracked, resulting in one final version being produced.

The second paper discussed a project that attempted to use crowdsourcing to create new resources from existing digital content, as found here: http://www.cvce.eu/en/epublications/mypublications.  Users search and find existing content and build up their own ‘story’ around it.  The project seems to have struggled to get the public interested and the speaker reported that almost all of the submissions so far have been by one individual, and that this individual is a retired academic who has a connection with the project.  The project is now working with this user in order to further develop the site, but it seems to me (and was raised by someone else in a question following the paper) that working more closely with an individual who already uses the site frequently isn’t going to help address why other users didn’t engage with the site and runs the risk of making the tool less appealing to regular users.  Having said this, the site does look nice and has the potential to be a useful platform.  Interestingly there is no moderation of user submissions as the project decided that this would not be a sustainable approach.  Instead they intend to take content down if it is reported, although none has been so far.

The third paper was about digital anthropology and how to engage local populations in order to document their knowledge.  The project targeted communities in the Alps to get local knowledge about avalanches and other dangers in order to compare this to the official knowledge.  The resulting website presented these maps on various layers, with embedded interviews and images.  Photographs were taken in the field with a GPS enabled camera so the photos were automatically ‘tagged’ for location.

The fourth paper was about tagging semantics in comics using crowdsourcing, for example indexing comic book panels and balloon types.  A prototype interface, based on the Zooniverse Scribe tool can be found here: http://dissimilitudes.lip6.fr:8182. The project will be using ‘comic book markup language’, which is TEI based.  Users have to mark things in images, such as characters and emotions and balloon types.  The annotations then go through an aggregator.  The text is already being extracted with OCR as it’s mostly good enough to extract automatically.

For the first parallel session of Day 2 I attended the session on scholarly editions.  This was mainly because I was interested in the first paper, which was about how the hierarchical model for marking up texts (e.g. TEI XML) might not be the best approach and that an alternative ‘graph model’ might be a better fit.  I was interested in this because there are some aspects of text that TEI just cannot handle very well, such as where things you need to tag don’t fit neatly within the nested structure required by XML.  Unfortunately the focus of the paper was more theoretical than practical and didn’t even really touch upon what the alternative ‘graph model’ might be never mind how it might be used so I was a little disappointed.  The second paper was about digital palaeography, but again this didn’t really have any practical element to it and so was of little use to me.

The third paper was about establishing a more standardised approach to creating digital scholarly editions.  The speaker was from an institution that hosts 29 digital editions, some of which were published a long time ago.  All but three of these are still operational, but this is only because someone is there to take care of them as they all use different technologies and approaches.  The approach they take these days is purely XML based – TEI text, existDB, XSLT and XQuery.  They use one standardised set of tools to do everything and don’t bring in new tools unless there is a case to be made for using them.  There are too many tools out there that can be used to achieve the same goal and it’s easier for sustainability and long-term maintenance if all developers can agree on one set of tools for each task.  This was very interesting to hear about as this kind of standardisation of approaches is something that is going on at Glasgow at the moment too.

The speaker mentioned how important documentation was, and how everything should be documented first and then kept up to date as a project progresses.  Ideally this documentation should be kept in a standardised format and should be accessible at a standard URL for future use.  The speaker also discussed packaging all of the materials, including the documentation, and stated that his institution uses the ‘expath’ packaging system.

Someone raised the interesting question about how to experiment with new technologies if everything has been standardised.  The speaker stated that experimenting with technology is good, and represents the research in a lot of DH work.  New technology can be incorporated into the institution’s toolbox if it has a proven use that other technology can’t provide, but a developer shouldn’t expect new tools to just be included and part of the toolbox straight away.  The speaker also mentioned the TEI processing toolbox (http://showcases.exist-db.org/exist/apps/tei-simple/index.html?odd=ODDity.odd) that can straightforwardly create an online digital edition from TEI documents.  I should look into this a bit more.

For the second parallel session of the day I headed to the session on maps and space.  The first paper was about building and analysing ancient landscapes using GIS and 3D technologies.  This isn’t really an area I have any involvement with (at the moment, at least) but it was a very interesting talk about the technologies used to map out the Mayan site of Copan in Honduras.  Information about it can be found here: http://mayacitybuilder.org/.  The project created 3D building objects using 3D Studio Max and plotted out the city using something called CityEngine (http://www.esri.com/software/cityengine) which although meant for modern cities could also be used for ancient sites.  Everything was then built using the Unity Engine and the eventual aim is to then convert this into WebGL for use through a web browser.

The second paper was about mapping data from a large corpus of text to see the places where certain topics relating to infant mortality appeared.  The underlying corpus was stored in a CQPWeb system and comprised of newspaper reports which were geoparsed using the Edinburgh Geoparser.  The project also used ‘density smoothing’ to turn individual points into broader areas.  The project encountered some OCR issues with the data and it would appear that rather than fix these they adapted their query strings to incorporate the issues when querying the corpus.  The speaker showed condordances and collocations to show the most common symptoms and treatment terms per decade and also plotted the data on maps.  Comparing the data from newspapers to official figures for deaths it seemed to show that the newspapers were writing about areas because they were newsworthy and this wasn’t necessarily related to the number of deaths.

The third paper was an overview of current practices favoured by web-based geo-humanities projects.  This was a hugely useful paper for me as I have been involved with mapping projects in the past and will be again in future and the paper included a lot of useful advice on what to do and what not to do when making a map-based interface.  The speaker looked at about 350 geohumanities projects, and around 50 in detail.  These were projects that were connected in some way with the Geohumanities special interest group of the ADHO.  The speaker looked at the data models and formats used by these projects and also the sorts of cartographic representations that they used and how they engaged with users.  Some interesting projects that were mentioned were bomb sight (http://bombsight.org), the Yellow Star Houses project (http://www.yellowstarhouses.org/), http://www.georeferencer.com, a crowdsourcing site for georeferencing historical maps, the ‘Building Inspector’ crowdsourcing site (http://buildinginspector.nypl.org/) created by New York Library that allows the public to identify types of buildings on old maps, the Old Maps Online resource (http://www.oldmapsonline.org/), which provides access to a massive collection of historical maps.

The speaker pointed out that making interactive tile based maps from old maps means we lose a lot of the information in the original – the stuff round the edges, annotations and other such things.  She also pointed out that more than 80% of the resources looked at used point data, most often used to access further data via a pop-up.  But in most cases the use of map markers could be better.  Icons could be used that embed meaning, or different sized icons could be used.  The problem with ‘pins’ on a map is that they can represent very different meanings at different scales – an exact point, or a whole city or state.  There are some big problems with certainty and precision so an exact point is not necessarily the best approach.  A good example of using more ‘fuzzy’ points is a map showing the Slave Revolt in Jamaica in 1760-61 (http://revolt.axismaps.com/map/).  This is a really good example of plotting routes over time on a map too. Basically the speaker recommended not using an actual map ‘pin’ unless you know the exact location, which makes a lot of sense to me.  She also pointed out that having dedicated map URLs for a map window is a very good idea for when it comes to people sharing a map.  This embeds the zoom level and coordinates in the page URL.  See the ’bomb sight’ site for an example of this.

For the third parallel session of the day I continued with Maps and Space.  The first paper was about a project that is mapping the role of women editors in Europe in the modern period.  This is the second project that I saw that is using the ‘Nodegoat’ system to generate spatial network diagrams, in this case mapping out the networks of correspondence between women editors and others with a time slider allowing the user to focus on a particular period of time.  The overall period the project is looking at is 1710-1920 and during this time the state boundaries of Europe shifted a lot so rather than countries the project chose to focus on cities.  The project intends to migrate its data to a Drupal system for publication, although I’m not entirely sure why the interface they demonstrated couldn’t just be used instead.

The second paper of the session was about a project that is looking to provide context to places that are mentioned on historical maps of Finland via an ontology service.  The project contains 3 million places over 460 maps and places can be added / edited by the public.  The third paper was a discussion of a project looking at multilingual responses to famine in India.  It concerns early modern travel writing, which are marked up as TEI texts and may involve multiple languages.  The project data is stored in an exist database and keywords appearing in texts, such as people, places and animals are tagged.  The routes of the travel writers can be plotted on a map interface.  Placename data comes from the www.geonames.org service and the map interface is based on Google Maps.  Routes are plotted as dots and there is also a timeline allowing the user to jump to a specific point in a tour.  A question that arose relating to the data is that places mentioned in the texts are often not definite but instead their location is an interpretation.  How should this be shown on a map?  Another crowdsourcing map-based site was mentioned during the questions: https://www.oldweather.org, which is another Zooniverse site.

For the third day I chose to attend the Building and analysing corpora session for the first parallel session.  The first paper was about creating an EpiDoc Corpus of inscriptions on stone in ancient Sicily.  This wasn’t directly relevant to anything I’m involved with but EpiDoc is TEI based and it was useful to hear about it.  The project consists of 3238 inscriptions from 135 museums. Data is presented via a Google map interface with a table showing the actual data and faceted browsing facilities.  It also has an API to allow museums to access their own specific data and use this in their own systems.  The metadata is all stored in XML and there is a Zotero based bibliographical database too.  Any changes to the record are recorded in the metadata and controlled vocabularies are used to catalogue each inscription, with these being taken from the Eagle project on Europeana.  Records are logged by material and type and history and provenance information is recorded as well.  The original data previously existed in an Access database and was exported from there.  Some of the data was very messy, for example dates or ranges of dates that were logged using codes that were often not consistently used.  Interpreted and diplomatic views of the inscriptions can be accessed, together with commentaries.  The XML data is stored in an exist database, with some information such as about museums stored in MySQL.  The data is exported as JSON and then parsed by PHP for display.

The second paper was about the Bentham Corpus, which is a well known crowdsourced corpus.  Of the original 60,000 folios 40,000 were untranscribed.  It took 50 years to transcribe 20,000 and crowdsourcing was used as a means of speeding this up.  The website launched 6 years ago and is based on mediawiki.  Users can add in simple tags representing TEI markup and a moderator checks the transcript.  There have been 16,000 transcriptions in 6 years and 514 people have transcribed something (which is rather less than I thought would have been involved).  The project had 26 ‘super transcribers’ and an average of 56 transcripts were produced a week, resulting in over 5 million transcribed words.  The transcription website can be found here: http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham.  The project is now building search facilities for the corpus, including a search index, topic models, clustering of similar words and visualisations.  The project is using a tool called ‘cortext manager’ to handle lexical extraction, clustering and visualisation.  Visualisations are also being generated using Gephi, and these are exported and made interactive using the Sigma.js library (something I really must look into a bit more).  Several visualisations and maps that the project has produced can be found here: http://apps.lattice.cnrs.fr/benthamdev/index.html

For the second parallel session of the day I attended the session of visualisations.  The first speaker talked about William Playfair, who I’d never heard of before but really should have been aware of as he was one of the first people to visualise data, and even invented a number of charts such as the bar chart and the pie chart.  The speaker’s project was about recreating some of Playfair’s original visualisations using D3.  This was fascinating to hear about.  The second paper concerned visualising ontologies, specifically relating to the correspondence of the astronomer Clavius.  Terms found in the 330 letters were manually extracted and developed into an OWL based ontology using the Protégé ontology editor.  This consisted of 106 classes in 4 hierarchical levels.  The project developed visualisations of the ontology to make analysis easier for both experts and non-experts.  The visualisations consist of node-link diagrams that are made to resemble hand-made diagrams.

The third paper was the second one I saw that analysed the JFK / Nixon campaign texts.  This one used something called ORATIO, which looked like a very nice tool.  Unfortunately I can’t seem to find any link to it or further information about it anywhere.  But the project looked at people, places, concordances and affinity in the speeches.  These were plotted on a map and on a timeline and the affinity view was rather nice – lines on the left and right representing JFK and Nixon and bubbles in the middle representing terms or people with their position showing whether it had more of an affinity to one or other speaker.

The fourth paper demonstrated a desktop based text viewer that could ‘zoom into’ text to show more details at different levels.  Much as with a Google Map interface when you zoom into levels more features become visible.  The idea appears to allow for scalable readings, from distant reading to close reading, although I’m not entirely sure how it could be used practically.  The final paper demonstrated some visualisations of networks of literary salons in Mexico City.  The project extracted data using a Python library called ‘Beautiful Soup’ and build visualisations using Gephi and the Sigma.js library.

For the final parallel session of the final day of the conference I decided to stick with visualisations, and this was a fun session to end the sessions with.  It mostly concerned 3D visualisations, games and virtual tours, which are not really directly related to anything I do at the moment, but it was great to see all of the demonstrations.  The first speaker talked about an educational videogame that he is creating about colonial Virginia and the slave trade.  The second speaker gave an overview of some of the more immersive technologies that are now available, such as VR and domes that have images projected onto them.  The third speaker discussed a specific project that is creating a 3D representation of an Ottoman insane asylum while the fourth gave a demo of a WebGL based interactive tour through a German cathedral that looked very nice.

So, that’s an overview of all of the parallel sessions I attended at DH2016!  It was an excellent conference and I feel like I’ve learned a lot, especially on the mapping and visualisation fronts.  In addition, it was great to visit Krakow as I’d never been.  It is a beautiful place and I would love to go back some day and do some more exploring.