Week Beginning 11th July 2016

I returned to work this week after being on holiday for the past two weeks.  I was only in the office for one day this week, though, as the rest of the time I was attending the DH2016 conference in Krakow.  Most of Monday was spend printing off materials for the conference, checking in for flights, figuring out where my hotel was and things like that.  I also read through the abstracts of the parallel sessions to try to decide which sessions I should attend.  The DH conference is a particularly large one and at various times there were up to nine parallel sessions, each consisting of up to five papers so deciding what to attend was something of a mammoth undertaking – especially as there were several sessions happening at the same time that all appealed to me.

The parts of Monday that I didn’t spend preparing for the conference I spent working on materials for a proposal Jane Stuart-Smith was submitting at short notice. This project involves several other institutions and some non-SCS people within Glasgow and seems rather unwieldy, but certainly has potential.  I commented on the proposal documents and the materials were submitted in time for the call deadline on the Tuesday.

Tuesday for me was spent travelling to Krakow via Schiphol, along with a few other colleagues from Glasgow, namely Marc Alexander, Fraser Dallachy, Katie Lowe and Johanna Green.  The journey went pretty smoothly, although there was a delay of about 40 minutes when departing Schiphol.  We arrived in Krakow too late to attend the opening ceremony and reception, so after checking into our respective hotels we met up, explored the city and got some food.

Wednesday was the start of the main conference and at registration I received the customary goody bag, which for this event included a branded DH water bottle, t-shirt, pen and USB stick, amongst other things.  The first parallel session began at 9:30 and I attended the session on ‘Analysing and visualising networks’, which was a short paper session consisting of 5 papers. The first paper was about a project called Kinomatics (http://kinomatics.com) that is looking into diversity and reciprocity in the global flow of contemporary cinema.  The project is dealing with 330 million records, looking at every film screening in 48 countries – some 97000 films and 33000 venues.  The speaker pointed out that this is not ‘big’ data compared to some datasets but it is ‘big data’ in that it is ungraspable by conventional approaches.  The project is presenting its findings using some interesting visualisations, such as plotting screenings on a radar chart, with months in a year as spokes and number of screenings in each month plotted on these spokes.  The shape of the chart could then be compared for different films.  Data was also visualised spatially and using a stacked column chart to show the technical formats each film used.  Further visualisations demonstrated the flow of transfers in and out of countries and a network diagram showing the reciprocity of transfers between countries.  The speaker pointed out that to get a more accurate picture of the diversity of films in a country you need to look at their exposure – the actual number of screenings.  The project’s data comes from a company that provides the data to Google.  The data is usually deleted after a month but the project is archiving it and maintaining the dataset.

The second speaker discussed methods of analysing social structures that evolve over time.  Adding a temporal aspect to network diagrams to show the evolution of a network is a pretty interesting subject and the speaker chose as his subject marriage in the early Mormon church, which is a useful dataset due to the non-binary nature of some Mormon marriages and also because there are very good records available.  One interesting way the project visualised a marriage was to use a chord diagram, with husbands, wives and their children all round the edges, with individuals colour-coded based on gender and lines between individuals showing their relationships.  For example, a child had connecting lines to its mother and father if both were represented in the marriage.  New wives, husbands or children could then be easily added to the diagram with the space taken up by the other actors updating as required.  To represent the generational flow over time these individual chord diagrams were then placed within a flow diagram that represented time from left to right, with the option of zooming in on individual chord diagrams within this and viewing lines representing individual relationships between multiple chord diagrams.

The third paper looked at networks of confidentiality and secrecy in correspondences and again was interested in plotting relationships in networks across time, but this time also plotting the data spatially on a map of Europe, for example having nodes as locations where a letter was sent to / from and lines connecting the two, with different colours representing directionality and line thickness showing the number of letters.  This interface incorporated a time slider allowing the user to select a specific period of time to focus on – a start date slider and an end date slider so as to give greater flexibility.  The project used a system called ‘NodeGoat’ (https://nodegoat.net/) which was also used for another project later in the week.  It seems like an interesting tool, although it is proprietary and isn’t free to use or adapt.  Some example maps similar to the ones shown in this presentation can be found here: http://mnn.nodegoat.net/viewer

The fourth paper looked at record linkage when dealing with sparse historical data.  In historical data we have limited data and identifying people across different documents and linking these up can be tricky.  The paper looked at resolving name ambiguity by a combination of natural language processing, linking name mentions into a knowledgebase and looking at the context of the names within the documents.  For example, if there is a singer and a politician that have the same name the system should be able to work out by the context which person has been mentioned.  But with historical documents we have a lack of available knowledge. The project looked at early modern Venetian apprenticeship contracts – some 55000 contracts over 200 years.  A subset of these was manually annotated for things like name, gender, age, profession and geography to see how likely it is that the features can be used to match individuals and various network diagrams were shown to demonstrate the effectiveness of certain pairings.  The project aims to then integrate the identification system into transcription software to automate the process.

The final paper of the session looked at mapping the activity of the League of Nations based on correspondence from between the two world wars. The project wanted to map out the relations between people on the committee based on the authors and receivers of letters and the topics of letters.  The speaker showed some very nice network diagrams showing correspondence connections, with the overall diagram containing 30,000 documents, 3,200 nodes and 26,500 connections.  The speaker demonstrated how the high-level graph could be drilled down into and individuals or groups could be highlighted.  The speaker mentioned ‘betweenness centrality’ as a mechanism for working out where a node should be positioned within the network diagram.  There were some interesting questions after the paper, for example how the temporal aspect could be incorporated and how people within the League of Nations were not static but changed their role.  The visualisations are also not currently dynamic.

For the second parallel session I attended a further session on ‘Analysing and visualising networks’, which again was a series of short papers.  The first paper that was shown was actually the second paper of the session due to some technical issues.  This paper gave an overview of various network graph techniques that the speaker had used with his dataset, which was Jacobite poetry.  He pointed out that network graphs look great and people are generally impressed by them but there is a danger that people don’t actually understand them.  The network diagrams that were shown were all produced using Gephi, which I will have to investigate again at some point.

The second paper (which should have been the first) was about methods for identifying the main characters in novels in order to represent these on network graphs with characters as nodes, their relationships as the edges and a weighting based on the strength of the relationship.  As readers we do this easily but to automatically create networks is challenging – computers need metrics, not just ‘feelings’ about which are the central characters.  The speaker discussed ‘coreferece resolution’ as the means of knowing that people who are identified in different ways are actually the same person – for example when they are identified with pronouns.  An automated system also needs to work out where there are interactions between characters.  A crude way is to say that when two characters appear in the same paragraph then there is a direct communication between them.  A more complicated way is to extract direct speech and actually work out who talks to whom.  The speaker looked at whether the simple method might actually perform well enough.  His research discarded pronouns and performed ‘named entity recognition’ using summaries of 58 novels from Kindlers Literary Lexicon Online.  After running a series of experiments it would appear that the simpler methods work best.  There was an interesting question after the paper about how to determine what a character actually is.  If someone is merely mentioned are they a character?  If God is mentioned a lot is he a character?  The speaker said they got around this by merely focussing on the main characters.

The third paper showed how a researcher had developed a network analysis of the field of economic history in Australia over 20 years, and specifically how co-location and geographic proximity are important in the development of an academic field:  academics who work near each other tend to do more together and these connections grow stronger over multiple years.  The speaker looked at three different types of collaboration: co-publishing, contributions and sub-authorship, with the network data based on citations and much data gathering and weighting done manually.  The connections were mainly centred on three main cities and sub-authorship was the most geographically diverse.  The findings from the network showed that people moving between the cities has a significant impact in transferring and spreading ideas.

The fourth paper looked at building a network diagram of interactions between people as a means of exploring a corpus.  The speaker demonstrated some really nice interactive visualisations of people mentioned in the campaign speeches of JFK and Nixon (these were also used in another paper on a different day).  The underlying data was passed through the Stanford named entity processor to work out co-occurrence in order to build the relationships between people.  The data was then loaded into a very nice tool called ALCIDE, which can currently be accessed here: http://celct.fbk.eu:8080/Alcide_Demo/. The tool presents some very nice interactive visualisations including network diagrams built in D3 and heatmaps built in Leaflet, both of which also have a double-ended time slider to enable any particular range of time to be selected.  There are also full-text searches, name clouds, frequency graphs etc.  All of the interfaces are linked into the full text of the speeches too.  For the visualisation the speaker noted that HTML5 Canvas proved to be much faster than SVG, but that this was more tricky to implement in D3.  I’ll have to look into this.  For the actual case study of JFK vs Nixon the study found that there were more people mentioned by JFK – 1244 vs 486.  The speaker noted that there is still work to be done with identifying pronouns using natural language processing.

The final paper of the session looked at mapping out relationships between authors and identification of areas of knowledge within the DH community.  It used DH journals from the ADHO and Scopus in order to build co-authorship and co-citation network diagrams.  There appears to be little international collaboration in the field.

For the third and final parallel session of the day I attended the session on Crowdsourcing.  The first paper was about the experiences of the ‘Letters of 1916’ project (http://letters1916.maynoothuniversity.ie/), which was especially useful to hear about as there are a couple of potential crowdsourcing projects that I may be involved with at some point.  The speaker pointed out that the success of their project (as with all such projects) was based on engagement with users.  The project had around 15,000 registered users and users are able to upload their own images of letters in addition to transcribing existing letters.  The interface developed by the project uses Scripto (http://scripto.org/), which uses parts of MediaWiki to manage user contributions.  It is also integrated with the Omeka content management system, which I’ve been interested in using for some time.  Apparently the Transcribe Bentham project used the same kind of setup too.  Users are asked to perform light TEI coding (basically adding tags by pressing on buttons and filling in the blanks between tags).  The speaker stated that 71% of its users are women and that social media is important for engagement.  Mondays to Wednesdays were the best days for engagement and most transcription was done in the afternoon.  The speaker noted that people need to be interested in the subject, and that there were peaks of activity around important dates, such as the centenary of the Easter rising.  36% of users were from an HE background and most were over 55.  Publicity in ‘old media’ really helped to connect to such people rather than the use of social media.  Connections to existing organisations such as Volunteer Ireland also helped, as did participating in community events.  To keep people involved giving them feedback and descriptions of the content was important.  With regards to the TEI markup the advice from the project is to keep it simple and not to tell people it’s ‘hard’.  Users should be able to figure it out without relying on any documentation.  Some ‘superusers’ emerged who did a large amount of the transcriptions.  In terms of the workflow several people transcribed each page and all the changes were tracked, resulting in one final version being produced.

The second paper discussed a project that attempted to use crowdsourcing to create new resources from existing digital content, as found here: http://www.cvce.eu/en/epublications/mypublications.  Users search and find existing content and build up their own ‘story’ around it.  The project seems to have struggled to get the public interested and the speaker reported that almost all of the submissions so far have been by one individual, and that this individual is a retired academic who has a connection with the project.  The project is now working with this user in order to further develop the site, but it seems to me (and was raised by someone else in a question following the paper) that working more closely with an individual who already uses the site frequently isn’t going to help address why other users didn’t engage with the site and runs the risk of making the tool less appealing to regular users.  Having said this, the site does look nice and has the potential to be a useful platform.  Interestingly there is no moderation of user submissions as the project decided that this would not be a sustainable approach.  Instead they intend to take content down if it is reported, although none has been so far.

The third paper was about digital anthropology and how to engage local populations in order to document their knowledge.  The project targeted communities in the Alps to get local knowledge about avalanches and other dangers in order to compare this to the official knowledge.  The resulting website presented these maps on various layers, with embedded interviews and images.  Photographs were taken in the field with a GPS enabled camera so the photos were automatically ‘tagged’ for location.

The fourth paper was about tagging semantics in comics using crowdsourcing, for example indexing comic book panels and balloon types.  A prototype interface, based on the Zooniverse Scribe tool can be found here: http://dissimilitudes.lip6.fr:8182. The project will be using ‘comic book markup language’, which is TEI based.  Users have to mark things in images, such as characters and emotions and balloon types.  The annotations then go through an aggregator.  The text is already being extracted with OCR as it’s mostly good enough to extract automatically.

For the first parallel session of Day 2 I attended the session on scholarly editions.  This was mainly because I was interested in the first paper, which was about how the hierarchical model for marking up texts (e.g. TEI XML) might not be the best approach and that an alternative ‘graph model’ might be a better fit.  I was interested in this because there are some aspects of text that TEI just cannot handle very well, such as where things you need to tag don’t fit neatly within the nested structure required by XML.  Unfortunately the focus of the paper was more theoretical than practical and didn’t even really touch upon what the alternative ‘graph model’ might be never mind how it might be used so I was a little disappointed.  The second paper was about digital palaeography, but again this didn’t really have any practical element to it and so was of little use to me.

The third paper was about establishing a more standardised approach to creating digital scholarly editions.  The speaker was from an institution that hosts 29 digital editions, some of which were published a long time ago.  All but three of these are still operational, but this is only because someone is there to take care of them as they all use different technologies and approaches.  The approach they take these days is purely XML based – TEI text, existDB, XSLT and XQuery.  They use one standardised set of tools to do everything and don’t bring in new tools unless there is a case to be made for using them.  There are too many tools out there that can be used to achieve the same goal and it’s easier for sustainability and long-term maintenance if all developers can agree on one set of tools for each task.  This was very interesting to hear about as this kind of standardisation of approaches is something that is going on at Glasgow at the moment too.

The speaker mentioned how important documentation was, and how everything should be documented first and then kept up to date as a project progresses.  Ideally this documentation should be kept in a standardised format and should be accessible at a standard URL for future use.  The speaker also discussed packaging all of the materials, including the documentation, and stated that his institution uses the ‘expath’ packaging system.

Someone raised the interesting question about how to experiment with new technologies if everything has been standardised.  The speaker stated that experimenting with technology is good, and represents the research in a lot of DH work.  New technology can be incorporated into the institution’s toolbox if it has a proven use that other technology can’t provide, but a developer shouldn’t expect new tools to just be included and part of the toolbox straight away.  The speaker also mentioned the TEI processing toolbox (http://showcases.exist-db.org/exist/apps/tei-simple/index.html?odd=ODDity.odd) that can straightforwardly create an online digital edition from TEI documents.  I should look into this a bit more.

For the second parallel session of the day I headed to the session on maps and space.  The first paper was about building and analysing ancient landscapes using GIS and 3D technologies.  This isn’t really an area I have any involvement with (at the moment, at least) but it was a very interesting talk about the technologies used to map out the Mayan site of Copan in Honduras.  Information about it can be found here: http://mayacitybuilder.org/.  The project created 3D building objects using 3D Studio Max and plotted out the city using something called CityEngine (http://www.esri.com/software/cityengine) which although meant for modern cities could also be used for ancient sites.  Everything was then built using the Unity Engine and the eventual aim is to then convert this into WebGL for use through a web browser.

The second paper was about mapping data from a large corpus of text to see the places where certain topics relating to infant mortality appeared.  The underlying corpus was stored in a CQPWeb system and comprised of newspaper reports which were geoparsed using the Edinburgh Geoparser.  The project also used ‘density smoothing’ to turn individual points into broader areas.  The project encountered some OCR issues with the data and it would appear that rather than fix these they adapted their query strings to incorporate the issues when querying the corpus.  The speaker showed condordances and collocations to show the most common symptoms and treatment terms per decade and also plotted the data on maps.  Comparing the data from newspapers to official figures for deaths it seemed to show that the newspapers were writing about areas because they were newsworthy and this wasn’t necessarily related to the number of deaths.

The third paper was an overview of current practices favoured by web-based geo-humanities projects.  This was a hugely useful paper for me as I have been involved with mapping projects in the past and will be again in future and the paper included a lot of useful advice on what to do and what not to do when making a map-based interface.  The speaker looked at about 350 geohumanities projects, and around 50 in detail.  These were projects that were connected in some way with the Geohumanities special interest group of the ADHO.  The speaker looked at the data models and formats used by these projects and also the sorts of cartographic representations that they used and how they engaged with users.  Some interesting projects that were mentioned were bomb sight (http://bombsight.org), the Yellow Star Houses project (http://www.yellowstarhouses.org/), http://www.georeferencer.com, a crowdsourcing site for georeferencing historical maps, the ‘Building Inspector’ crowdsourcing site (http://buildinginspector.nypl.org/) created by New York Library that allows the public to identify types of buildings on old maps, the Old Maps Online resource (http://www.oldmapsonline.org/), which provides access to a massive collection of historical maps.

The speaker pointed out that making interactive tile based maps from old maps means we lose a lot of the information in the original – the stuff round the edges, annotations and other such things.  She also pointed out that more than 80% of the resources looked at used point data, most often used to access further data via a pop-up.  But in most cases the use of map markers could be better.  Icons could be used that embed meaning, or different sized icons could be used.  The problem with ‘pins’ on a map is that they can represent very different meanings at different scales – an exact point, or a whole city or state.  There are some big problems with certainty and precision so an exact point is not necessarily the best approach.  A good example of using more ‘fuzzy’ points is a map showing the Slave Revolt in Jamaica in 1760-61 (http://revolt.axismaps.com/map/).  This is a really good example of plotting routes over time on a map too. Basically the speaker recommended not using an actual map ‘pin’ unless you know the exact location, which makes a lot of sense to me.  She also pointed out that having dedicated map URLs for a map window is a very good idea for when it comes to people sharing a map.  This embeds the zoom level and coordinates in the page URL.  See the ’bomb sight’ site for an example of this.

For the third parallel session of the day I continued with Maps and Space.  The first paper was about a project that is mapping the role of women editors in Europe in the modern period.  This is the second project that I saw that is using the ‘Nodegoat’ system to generate spatial network diagrams, in this case mapping out the networks of correspondence between women editors and others with a time slider allowing the user to focus on a particular period of time.  The overall period the project is looking at is 1710-1920 and during this time the state boundaries of Europe shifted a lot so rather than countries the project chose to focus on cities.  The project intends to migrate its data to a Drupal system for publication, although I’m not entirely sure why the interface they demonstrated couldn’t just be used instead.

The second paper of the session was about a project that is looking to provide context to places that are mentioned on historical maps of Finland via an ontology service.  The project contains 3 million places over 460 maps and places can be added / edited by the public.  The third paper was a discussion of a project looking at multilingual responses to famine in India.  It concerns early modern travel writing, which are marked up as TEI texts and may involve multiple languages.  The project data is stored in an exist database and keywords appearing in texts, such as people, places and animals are tagged.  The routes of the travel writers can be plotted on a map interface.  Placename data comes from the www.geonames.org service and the map interface is based on Google Maps.  Routes are plotted as dots and there is also a timeline allowing the user to jump to a specific point in a tour.  A question that arose relating to the data is that places mentioned in the texts are often not definite but instead their location is an interpretation.  How should this be shown on a map?  Another crowdsourcing map-based site was mentioned during the questions: https://www.oldweather.org, which is another Zooniverse site.

For the third day I chose to attend the Building and analysing corpora session for the first parallel session.  The first paper was about creating an EpiDoc Corpus of inscriptions on stone in ancient Sicily.  This wasn’t directly relevant to anything I’m involved with but EpiDoc is TEI based and it was useful to hear about it.  The project consists of 3238 inscriptions from 135 museums. Data is presented via a Google map interface with a table showing the actual data and faceted browsing facilities.  It also has an API to allow museums to access their own specific data and use this in their own systems.  The metadata is all stored in XML and there is a Zotero based bibliographical database too.  Any changes to the record are recorded in the metadata and controlled vocabularies are used to catalogue each inscription, with these being taken from the Eagle project on Europeana.  Records are logged by material and type and history and provenance information is recorded as well.  The original data previously existed in an Access database and was exported from there.  Some of the data was very messy, for example dates or ranges of dates that were logged using codes that were often not consistently used.  Interpreted and diplomatic views of the inscriptions can be accessed, together with commentaries.  The XML data is stored in an exist database, with some information such as about museums stored in MySQL.  The data is exported as JSON and then parsed by PHP for display.

The second paper was about the Bentham Corpus, which is a well known crowdsourced corpus.  Of the original 60,000 folios 40,000 were untranscribed.  It took 50 years to transcribe 20,000 and crowdsourcing was used as a means of speeding this up.  The website launched 6 years ago and is based on mediawiki.  Users can add in simple tags representing TEI markup and a moderator checks the transcript.  There have been 16,000 transcriptions in 6 years and 514 people have transcribed something (which is rather less than I thought would have been involved).  The project had 26 ‘super transcribers’ and an average of 56 transcripts were produced a week, resulting in over 5 million transcribed words.  The transcription website can be found here: http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham.  The project is now building search facilities for the corpus, including a search index, topic models, clustering of similar words and visualisations.  The project is using a tool called ‘cortext manager’ to handle lexical extraction, clustering and visualisation.  Visualisations are also being generated using Gephi, and these are exported and made interactive using the Sigma.js library (something I really must look into a bit more).  Several visualisations and maps that the project has produced can be found here: http://apps.lattice.cnrs.fr/benthamdev/index.html

For the second parallel session of the day I attended the session of visualisations.  The first speaker talked about William Playfair, who I’d never heard of before but really should have been aware of as he was one of the first people to visualise data, and even invented a number of charts such as the bar chart and the pie chart.  The speaker’s project was about recreating some of Playfair’s original visualisations using D3.  This was fascinating to hear about.  The second paper concerned visualising ontologies, specifically relating to the correspondence of the astronomer Clavius.  Terms found in the 330 letters were manually extracted and developed into an OWL based ontology using the Protégé ontology editor.  This consisted of 106 classes in 4 hierarchical levels.  The project developed visualisations of the ontology to make analysis easier for both experts and non-experts.  The visualisations consist of node-link diagrams that are made to resemble hand-made diagrams.

The third paper was the second one I saw that analysed the JFK / Nixon campaign texts.  This one used something called ORATIO, which looked like a very nice tool.  Unfortunately I can’t seem to find any link to it or further information about it anywhere.  But the project looked at people, places, concordances and affinity in the speeches.  These were plotted on a map and on a timeline and the affinity view was rather nice – lines on the left and right representing JFK and Nixon and bubbles in the middle representing terms or people with their position showing whether it had more of an affinity to one or other speaker.

The fourth paper demonstrated a desktop based text viewer that could ‘zoom into’ text to show more details at different levels.  Much as with a Google Map interface when you zoom into levels more features become visible.  The idea appears to allow for scalable readings, from distant reading to close reading, although I’m not entirely sure how it could be used practically.  The final paper demonstrated some visualisations of networks of literary salons in Mexico City.  The project extracted data using a Python library called ‘Beautiful Soup’ and build visualisations using Gephi and the Sigma.js library.

For the final parallel session of the final day of the conference I decided to stick with visualisations, and this was a fun session to end the sessions with.  It mostly concerned 3D visualisations, games and virtual tours, which are not really directly related to anything I do at the moment, but it was great to see all of the demonstrations.  The first speaker talked about an educational videogame that he is creating about colonial Virginia and the slave trade.  The second speaker gave an overview of some of the more immersive technologies that are now available, such as VR and domes that have images projected onto them.  The third speaker discussed a specific project that is creating a 3D representation of an Ottoman insane asylum while the fourth gave a demo of a WebGL based interactive tour through a German cathedral that looked very nice.

So, that’s an overview of all of the parallel sessions I attended at DH2016!  It was an excellent conference and I feel like I’ve learned a lot, especially on the mapping and visualisation fronts.  In addition, it was great to visit Krakow as I’d never been.  It is a beautiful place and I would love to go back some day and do some more exploring.