
Month: July 2019
Week Beginning 22nd July 2019
I was on holiday last week and had quite a stack of things to do when I got back in on Monday. This included setting up a new project website for a student in Scottish Literature who had received some Carnegie funding for a project and preparing for an interview panel that I was on, with the interview taking place on Friday. I also responded to Alison Wiggins about the content management system I’d created for her Mary Queen of Scots letters project and had a discussion with someone in Central IT Services about the App and Play Store accounts that I’ve been managing for several years now. It’s looking like responsibility for this might be moving to IT services, which I think makes a lot of sense. I also gave some advice to a PhD student about archiving and preserving her website and engaged in a long email discussion with Heather Pagan of the Anglo-Norman Dictionary about sorting out their data and possibly migrating it to a new system. On Wednesday I had a meeting with the SCOSYA team about further developments of the public atlas. We decided on another few requirements and discussed timescales for the completion of the work. They’re hoping to be able to engage in some user testing in the middle of September, so I need to try and get everything completed before then. I had hoped to start on some of this on Thursday, but I was struck down by a really nasty cold that I’ve still not shaken yet, which made focussing on such tricky tasks as getting questionnaire areas to highlight when clicked on rather difficult.
I spent most of the rest of the week working for DSL in various capacities. I’d put in a request to get Apache Solr installed on a new server, so we could use this for free-text searching and thankfully Arts IT Support agreed to do this. A lot of my week was spent preparing the data from both the ‘v2’ version of the DSL (the data outputted from the original API, but with full quotes and everything pre-generated rather than being created on the fly every time an entry is requested) and the ‘v3’ API (data taken from the editing server and outputted by a script written by Thomas Widmann) so that it could be indexed by Solr. Raymond from Arts IT Support set up an instance of Solr on a new server and I created scripts that went through all 90,000 DSL entries in both versions and generated full-text versions of the entries that stripped out the XML tags. For each set I created three versions – on that was ‘full text’, one that was full text without the quotations and the other that was just the quotations. The script outputted this data in a format that Solr could work with and I sent this on to Raymond for indexing. The first test version I sent Raymond was just the full text, and Solr managed to index this without incident. However, the other views of the text required working with the XML a bit, and this appears to have brought in some issues with special characters that Solr is not liking. I’m still in the middle of sorting this out and will continue to look into it next week, but progress with the free-text searching is definitely being made and it looks like the new API will be able to offer the same level of functionality as the existing API. I also ensured I documented the process of generating all of the data from the XML files outputted by the editing system through to preparing the full-text for indexing by Solr, so next time we come to update the data we will know exactly what to do. This is much better than how things previously stood, as the original API is entirely a ‘black box’ with no documentation whatsoever as to how to update the data contained therein.
Also during this time I engaged in an email conversation about managing the dictionary entries and things like cross references with Ann Ferguson and the people who will be handling the new editor software for the dictionary, and helped to migrate control for the email part of the DSL domain to the control of the DSL’s IT people. We’re definitely making progress with sorting out the DSL’s systems, which is really great.
I’m going to be working for just three days over the next two weeks, and all of these days will be out of the office, so I’ll just need to see how much time I have to continue with the DSL tasks, especially as work for the SCOSYA project is getting rather urgent.
Week Beginning 8th July 2019
This week I attended the DH2019 conference (https://dh2019.adho.org/) in Utrecht, the first time I’ve attended the annual Digital Humanities conference since it was held in Krakow in 2016. It was a hugely useful conference for me, and it was especially good to attend some of the geospatial sessions and learn about how other people are producing map-based systems. As usual it was a very busy conference, with many hundreds of attendees and numerous parallel sessions on each of the three main days of the event. The conference venue was rather difficult to navigate around at first, with sessions being split over many different floors with inadequate signposting and a pretty useless map that didn’t even have half of the rooms noted on it. However, after some initial confusion I managed to get the hang of the layout.
I travelled out to Utrecht on the Monday afternoon, and did a bit of work and explored the city on Tuesday morning, before attending the opening ceremony and lecture on Tuesday afternoon. There are apparently 450 scholarly contributions to the conference, and of all the submitted material 35% of short papers, 40% of long papers and 55% of posters were accepted.
The first plenary was given by Frances Nyamnjoh of the University of Cape Town, who gave a very interesting talks about complexities, incompleteness and compositeness of being, and how these theme resonate with Africa and its culture. He talked about notions of incompleteness in the stories of Africa, citing the ‘Plan-wine Drinkard’ by Amos Tutuola, which involves (amongst other things) a supernatural skull who becomes a composite being, borrowing body parts from others, in order to ensnare a beautiful woman, but how the skull acknowledged his indebtedness to the people he borrowed parts from. The speaker pointed out that it’s the same with DH – we are composite and need to trace our debts and indebtedness in who we are and what we have come from.
The speaker also pointed out that technology is in many ways similar to the concept of ‘Juju’ in African culture – ‘magical’ devices that enhance our ability – extend ourselves with godlike qualities such as omnipotence. The speaker stated that we can gain a lot from a DH that is truly inclusive, and that we should see incompleteness as a normal way of being. We should see DH as more of a compositeness of being and understand that it’s not negative to be incomplete, and that DH should speak the language of conviviality and take in from all disciplines to make us better.
The following day was the first full day of the event. The first session I chose to attend was ‘Cultural Heritage, Art/ifacts and Institutions’, which comprised of three papers. The first paper was ‘I-Media-Cities: A Digital Ecosystem Enriching A Searchable Treasure Trove Of Audio Visual Assets’, which presented an overview of a Horizon 2020 project about preserving and displaying the collections of 9 video archives across Europe. The presenter relied a bit too heavily on playing video clips rather than talking, which I’m not at all keen on, but the gist of the project is that they’ve built systems to automatically analyse film in order to detect things in it, such as people and buildings, and also do more complicated thins such as detect the gender of people and their jobs based on their clothes and contents. The project’s platform is accessible at https://www.imediacities.eu/ and in addition to allowing the project team to both manually and automatically tag content general users of the website can engage in such activities too.
The platform hosts 9000 images and 750 videos. When contents are uploaded the system automatically processes them to identify camera movements such as zoom and pan, it segments videos into scenes, and identifies objects and elements shown. The object detection tool identifies people, means of transport, architectural elements (squares, fountains) and assigns confidence ratings by means of colour coding – e.g. lower than 50% is red.
The automated processing took almost 15,000 hours of computing time for about 10,000 items. The project discovered that manual annotations find more terms – about 2000 as opposed to 78 elements for automatic – but automatic processes made 422,000 annotations as opposed to 29,000 manual. The algorithms were very good at identifying certain things (e.g. ties are easy to spot as they are always found underneath a head) and the algorithms had a 70% reliability rate.
The tags were given 2 colours to differentiate between manual and automatic, and a further colour will be added for public tags. The project needed to undertake frame by frame shot revision using a tool for the manual correction of inaccuracies, and contents were also geotagged so they can be visualised on maps.
The project also created a virtual collection creator tool for simulating an exhibition in 3D with a first person view. This used Blender and Blend4Web to generate. It is very nice to be able to build your own collection and then explore it in a 3D gallery, but I’m not convinced that it serves much practical purpose. The project finished in April and the platform will be sustained, with more archives joining.
The second paper was ‘More Than Just CG: Storytelling And Mixed Techniques In An Educational Short Movie’, which turned out to be presented by the same presenter as the first paper. The presenter explained that computer graphics help to contextualise and explain different cultural environments. It can be used to show architectural reconstructions of the past – fill in the gaps, and it helps people to understand things. It can also be used to bring in an emotional narration vi storytelling and capture the audience’s attention. The speaker stated that live actors are still often preferred, mixed in with CG landscapes. 3D animated characters are still limited in museums due to high costs. For example, it took the project 3 months of work for one character and 6 months of work for another. However, CG is very effective especially with younger audiences. The project produced a CG video for the Museo Delle Terre Nuove in Italy (http://www.museoterrenuove.it/). The museum wished to engage more with younger visitors so developed CG video to broaden its appeal. The project produced a short animated educational video, mixing in live shots taken by drone combined with CG and other techniques. It used ESRI CityEngine (https://www.esri.com/en-us/arcgis/products/esri-cityengine/overview) to turn 2D GIS into 3D models, added in 3D CG with Blender, used CAD files and city plans. The speaker pointed out that there is a certain coldness related to procedural modelling, so they also used traditional techniques such as mixing in 2D drawings on 3D planes to render flowers and such things. They used mostly free textures that are available on the web, but made some of their own, such as ones for Tuscan bricks. The speaker noted that human figures were difficult, and that crowds and secondary characters were added in using the Crowdmaster addon for Blender. Dynamic birds and other such things were also added in using Blender. Google Earth was also used when zooming out, showing structures that are still visible today. It took 6 people 1 year to make a 10 minute clip, which used 15.3Gb of Blender files and 150,000 core hours to process on supercomputer.
The final paper in the session was ‘Visualizing Networks of Artistic Ideas in History Paintings in the Seventeenth-Century Netherlands’, which looked at how an artist chooses subject and style and how this can be demonstrated through network analysis. The speaker pointed out that art historians are sceptical about network diagrams in their field, and that network diagrams had previously just focussed on social networks rather than the works themselves. The speaker’s project was instead about the work – the links between artists based on the subject matter. It was based on RKDImages (https://rkd.nl/en/collections/explore) – a collection of 150,000 paintings tagged by the Iconclass iconographic notations system. The project created a network with 2 modes – artist and subject matter.
For example, there is a high level of separation between artists painting Diana bathing and Diana hunting – few do both during their lives. Bathing tends to have lots of nudes while hunting mostly involves painting of animals, so different painting skills are required to do these types. Using the data it is then possible to look at links between artists – links between artists who painted both compared to those who only painted one – and can then measure the network of ideas.
The speaker pointed out that ‘betweenness’, that measures the node as a bridge between nodes, is not relevant to this study as we’re just looking at subject matter. However, the resulting networks can be interesting – e.g. the size of the network, the network diversity. The network connectivity degree can show what artists painted which things and how many. It’s possible to identify painters that painted lots of different things – a trend setter and then followed by other artists – or just painted what was popular. The visualisations can also show clusters based on subject matter – groups of artists who paint the same things.
The speaker demonstrated how the network diagrams can be used to show change in Dutch art over time – showing how new subjects appear and existing ones drop off over time, and how most new subjects peak during a period of explosive growth. In 1575-1610 there are different techniques but similar subjects. In 1610-1650 the size of network tripled. This is the pinnacle of the golden age, and the density of network tripled as well. Four or five main painters share same level of connections in different cities. Rembrandt, de Witte etc. Then in 1650-1700 the size and density shrink. The speaker pointed out that there are changes in subject matter over time – the network of iconography. From the first to second period this rises from 35 to 75, and then in final period shrinks to 57. However, the density and interactivity increase dramatically – everyone is painting everything. The speaker pointed out that there is also geographical variation, with Amsterdam being biggest in the network and many artists also appear in different groups. Artistic choice changes with the different stages too, but the choice of subject matter seems unrelated to the choice of style. The speaker pointed out that the study was restricted to history paintings as these are tagged very well in Iconclass – paintings that depict a story, such as Old Testament, New Testament and mythology scenes. The study needs to be broadened out to other types of painting. The speaker also pointed out that Iconclass is very detailed and few know how to use it well. Image recognition might work better to pick out any aspects of the image rather than just iconography. A difficulty with Iconclass is that different people can use slightly different Iconclass categories to tag the same sort of scene.
The next session I attended was ‘Space Territory GeoHumanities’ and the first paper was ‘Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922’. The project started a year ago and its goal was to extract addresses from directories from 1839 to 1922. The directories are almanacs that list people in alphabetical order together with their activities and address. The almanacs were first published in the 1650s, and have been published annually every since. Using them it is possible to geographically place many other kinds of documents. The project digitised 83 years of almanacs totally some 4 million addresses. The pages were OCRed and transformed into a database. It’s similar to a project undertaking in New York that digitised 40 million addresses and also the Layers of London project which did 5 million addresses on 5,000 streets. The Paris project covers 35,000 streets and 27,000 pages of OCR. The pages are always structurally identical, with name, job and street number separated by commas and arranged into 3 columns on the page. The project used DHSegment (https://dhlab-epfl.github.io/dhSegment/) to split the images into columns and used Google Cloud Vision (https://cloud.google.com/vision/) to process the images and handle OCR.
The project did encounter some problems due to the books being large and the text being curved on the edges of the page. There were also a lot of special characters, text in blocks with borders and other non-standard text. The project experienced an error rate of between 5 and 15%. The separated data was split into a table with columns for names, addresses and jobs. Extracted data was passed through error checking processes to ensure the data matches certain patters – e.g. the number of commas. Data was then visualised on vectorised maps, with given street names matched up with official names (e.g. Louvois became Rue de Louvois). As roads and places change all the time the project needed accurate road maps for each year, which were taken form https://alpage.huma-num.fr/. The project developed a tool to explore the data on maps for a specified year, presenting data as clusters that expand as the user zooms in. Users can then click through to the almanac page where the address appears. There is also a timeline view which shows how districts change over time – e.g. the number of cafes increasing or decreasing over the years. The project also created graph of common activities over time – most popular is wine sellers, grocers, doctors, tailors. Wine selling the most popular by far (6%). Next biggest is grocer on 2.5%. Graphs by decade show wine sellers massively increase into early 20th century. The almanac reflects how the professional describes themselves – so lots of different words for florists. Using the tool it’s possible to plot which occupations are becoming more popular or declining over time. It’s also possible to look at certain family names and their connections to professions.
The next paper in the session was ‘Capturing the Geography of 1900s Britain as Text: Findings from the GB1900 Crowd-Sourced Gazetteer Project’, which was especially interesting to hear about as I’ve worked with the NLS maps people who were heavily involved in this project. The project used crowdsourcing to extract names from the 6” OS maps to make a gazetteer. The speaker pointed out how difficult it is to get funding for crowdsourcing projects, stating that this project didn’t receive any funding. Ideally such a project would use first edition maps, but different the first edition had different map projections for different counties so instead the project used the second edition. The maps are 1:10,000 scale and include streets, woods, farms all named. The project followed on from the Cymru1900 project, which was funded by Welsh government and used software created by Zooniverse. GB1900 was based on the same software but replaced the maps of Wales which came from a commercial source with a new mosaic for whole of GB that was created by NLS. The project just needed to tweak the interface a bit, launching in 2016 and completing in 2018. The speaker displayed a graph of the time it took to get content. For Wales contributions tailed off after a few months, and names were transcribed once so there was no confirmation. For GB1900 the duration lasted twice as long and got second transcriptions. There were 1,234 registered contributors, even though the project had no money for publicity. The speaker stated that social media was used to get most people engaged. They promoted engagement through a leaderboard with access to maps for top contributors. Most of the work was done by a few, and the top 150 contributors did the most. The project didn’t record how long people were logged in for, but tracked the time they spent doing transcriptions. The project estimated that 21,359 hours were spent, equalling around 577 full-time weeks. The project interviewed 5 contributors. There were a substantial number of women contributors and lots of retired people. One contributor did almost 300,000 transcriptions, totalling some 990 hours. With crowdsourcing projects people generally want to contribute to science, but with this project the desire is to contribute to the historical and geographical. Contributors said that the process was enjoyable or even addictive. People often transcribed places that had some connection to them – family origins, family holidays, work. People didn’t just do an abstract contribution but a personal connection to the places.
The resulting gazetteer included 2.5 million transcriptions but many of these are not place-names e.g. ‘F.P’ for footpath. If these are filtered out then there are about 1 million place-names. Compared to other open data gazetteers of Britain this is very large. E.g. Geonames has 63,000, NGA has 32,444. OS gazetteers have a very detailed gazetteer – 450 million features, but it is not open data and is very expensive. The OS released about 250,000 names in 2010 and this has since been replaced by OS OpenNames in 2017. This features nearly 3 million entries but 58% are postcodes and 31% are street names. DEEP was a place-names project that was funded by JISC from 2011-2013, covering large amount of England and Wales. But its website http://placenames.org/ disappeared last July. The underlying data can be found on the EPNS website and is also found on the JISC website, where there are 66 files with almost 10 million lines of MADS XML with no documentation. The speaker extracted these and created a database. There are about 500,000 entries but 70% are field names, with no source and no coordinates. The speaker worked to match GB1900 and the DEEP data. The gazetteer data can be downloaded from http://visionofbritain.org.uk/data/#tabgb1900.
The final paper in the session was ‘Active Learning from Scratch in Diverse Humanities Textual Domains: Optimizing Annotation Efficiency for Language-Agnostic NER’. This paper concerned entity recognition in spatial-humanities – e.g. the extraction of place-names from text. The speaker has written a tool that can be used to automatically extract such data (see https://github.com/Alexerdmann/HER). The speaker pointed out that entity extraction is complex and expensive, especially in language varieties or literary texts where information is unstructured or semi-structured. The tool recognises and classifies entities and then links them to knowledgebases, then evaluates the recognition. The tool needed to be granular and be customisable, low cost in terms of time and reusable. It also needed to be free and open. It uses ‘active learning’ – training the classifier by intelligently picking the best examples to label. The tool uses the CRF suite (http://www.chokkan.org/software/crfsuite/) and needs Python 3 to run. You manually annotate senses on one text and the tool trains on this. It then runs though the corpus and analyses how it can increase the performance. It has been used on a few projects, including a project to extract thousands of proper names from medieval French texts. The tool is just a command-line tool for now – no GUI.
For the next session I returned to ‘Cultural Heritage, Art/ifacts and Institutions’, this time a session of five short papers. The first was ‘A World of Immersive Experiences: The Scottish Heritage Partnership’ which is based in Information Studies at Glasgow. Unfortunately, it wasn’t really a proper presentation but was instead just a video, which is not really the best way to engage with a conference audience, and I didn’t really learn much from it. The second paper was a bit more interesting, however. The paper was ‘Atlantic Journeys through a Dutch City’s Past: Building a Mobile Platform for Urban Heritage’ and described an Android / IOS app that had been built to navigate through the history of the small Dutch city of Groningen. The app is guided by animated historical figures and tracks users as they wander around the city, to see what users find the most interesting. It was aimed at tousits, school children and locals and includes lots of history linked to the Dutch West India trading. It includes a golden-age map of the city with each separate house on it. Users can follow routes through the city and provides links to historical images and texts. It also includes an augmented reality feature allowing users to hold up their phone and see the historical images within the context of the modern city and even take selfies this way. It includes life stories of important historical figures including pirates, a very pleasing zoom and pan interface and even audio clips of 17th century poetry. The team are currently working on a follow-up app called ‘Atlantic Stories’.
The next paper was ‘Close-Up Cloud: Gaining A Sense Of Overview From Many Details’ and demonstrated an experimental visualisation interface for museum images. It was based on a collection of glass-plate negatives created about 1900 which was created to document the inventory of the museum and was not originally for public use. These have now become museum objects themselves. The team visited the museum and saw images on a light desk, and wanted to be able to transfer this experience to the digital and guide users through the details of the images. They didn’t want to merely translate the experience to the digital but to add more that can’t be experienced in the analogue. Museum staff had added text to the images and this was very valuable in identifying iconographic details, through which it was possible to map connections between objects. The team decided to cut images into iconographic details in order to guide eyes to details. They developed an interface with 3 views. First of all there was a cloud view showing an overview of collection’s iconography as thumbnails, rather like a ‘word cloud’ but for images – larger images have higher frequency. This is an overview but is also a close viewing of the details of the collection. It features automatic image replacement, meaning every minute or so images used to represent particular items are replaced, resulting in an almost infinite number of collages. For example the ‘angel’ concept rotates between different images that feature angels. Clicking on an image then loads the second view – ‘tag view’ which presents a grid of all images where the selected concept is tagged. Hovering over one image highlights all of the images from the same object and clicking on an image loads the third view ‘object view’ which shows the full image, all tags applied to the image and all accompanying metadata. When viewing the full image it’s possible to hover over a tag to show the parts of the image that match – other parts fade away, which is a very nice feature. Feedback from use of the interface has been very positive, and it’s seen as a good way to engage lay-people as no prior knowledge is needed and it’s easy to browse through the collections. The website is still just a prototype so no link is available yet. It’s mostly Javascript based, and all areas of images were manually tagged, using HTML canvas element. It also uses Iconclass for its controlled vocabulary.
The fourth paper in the session was ‘Creating Complex Digital Narratives for Participatory Cultural Heritage’ and was presented by a PhD student who is looking at cultural heritage tourism and how to incorporate interactive components from audio guides to exhibits. Such components tend to have a ‘top down’ approach or one way communication – created by experts for users which can exclude certain social groups. There are digitised archives too – databases for memory institutions that are used to maintain their content but also to engage with the public. Much narrative potential, and knowledge building processes have been democratised by the digital. But what is the best method for creating stories of the past and incorporating user generated content. The speaker suggested that the best approach would be to combine top down with user generated content from bottom up – ‘Narrativizing cultural heritage’. The speaker compared this to pre-digital ‘choose your own adventure’ books, and discussed how interactive digital narratives of many types from interactive exhibitions to video games can be used to enable the user’s choice to change the storyline. We have maps, paintings, art, websites, blogs, books and we have user generated content e.g. trip advisor, social media, blogs, oral histories, genealogy and ancestry. There are many multiple perspectives but where to start? The speaker suggested creating a ‘mothership’ interactive digital network to connect it all together. Multimodal content – creating topic models from this and generating ‘protostories’ in the hope of uncovering ‘lost’ stories, include multiple perspectives.
The final paper was ‘Sustaining the Musical Competitions Database: a TOSCA-based Approach to Application Preservation in the Digital Humanities’ and was more of a digital preservation paper rather than a DH paper. The speakers pointed out that all DH projects use software – web pages, APIs, etc that produce something that is generally left somewhere online. All websites ag and can break and how best to maintain these heterogenous projects? The speakers suggested three approaches are getting someone to continue to maintain the resource via a service level agreement, migrate the content to another system or archive it. The speakers discussed the SLA approach, using the TOSCA standard, an OASIS standard described in XML as means of managing different systems effectively (see https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=tosca). The speaker discussed the openTOSCA tool – a modelling tool that can be used to install and manage systems in a runtime environment and gave as one example a Musical compositions database. A system is broken down into separate components that can be split over multiple stacks and swapped out as required.
For the final session of the day I attended the ‘Language, Languages’ session, which featured our paper on the Historical Thesaurus. The first paper of the session was ‘A Predictive Approach to Semantic Change Modelling’. The speaker discussed how word meaning changes over time but that why and how is still an active research question that is complex as it involves phonetic, syntactic and semantic change. There are now computational models of linguistic change that look at large diachronic corpora. It’s possible to model the time dimension and train models to look at specific decades and then compare models to see the change. It’s possible to empirically assess semantic change, e.g. cosine distance between embedding vectors, and derive ‘laws’ of semantic change. The speaker investigated whether it’s possible to learn word embedding by decade and predict what might happen in the next decade – not just representing what’s going on but predicting change too. The project used the Google Ngram corpus split into decades 1800-1999 and used a neural network to predict change. It assesses prediction quality based on the current situation. For the top 100 words that feature a considerable semantic shift it has 41% accuracy rate looking at neighbour words – words in context. However, if the system just looks at the most frequent words then accuracy is much better – for the top 1000 words the accuracy is 91%. However, most words don’t change meaning so the stable words push accuracy level up.
The second paper in the session was ‘Applying Measures of Lexical Diversity to Classification of the Greek New Testament Editions’. This was a very technical paper and many of the details unfortunately went over my head. It was about lexical diversity in NT texts looking at text length – Yule’s K, Z constant. It was applied to Greek NT looking at varying words and punctuation. Ten editions of ten books of the New Testament were compared in order to investigate attribution. The speaker looked at the correlation between tokens and types and punctuation ratio based on several algorithms. This used decision tree and random forest techniques using Python.
The third paper was ‘VinKo: Language Documentation Through Digital Crowdsourcing’. The speaker discussed a project that gathered oral linguistic data in Northern Italy, which was part of the Atheme project (https://www.atheme.eu/). It asked participants to speak into a microphone rather than writing down their linguistic varieties, and the entire project was crowdsourced. The project’s advice was to keep the questionnaires short and fun and record not just what variety speakers have, but what village they’re from too. The project looked at Germanic varieties e.g. Tyrolean in the north and Germanic language ‘islands’ that are minorities, and how these developed differently from Tyrolean. Also Romance varieties – main one is Trentino. The project produced a map showing where people are from to demonstrate how pronunciation changes between areas. The first version of the data gathering tool is now offline, and the results (phonological and syntactical) will be published in October. The speaker pointed out that getting people to do their own recordings using their own smartphones resulted in variable quality – of about 200 questionnaires only about 100 were usable, but some of the others were still fine for morphology and syntax.
The fourth paper was ‘ (Re)connecting Complex Lexical Data: Updating the Historical Thesaurus of English’, which was presented by Fraser. It all went fine and I won’t go into any detail about it here as I would just be repeating stuff that I’ve already spent far too long discussing in other posts. The final paper of the day was ‘A Transcription Portal for Oral History Research and Beyond’ which discussed a new online tool that is being prepared to help with the task of transcription. The speaker stated that interviews are an essential part of oral history and lots of recordings have been made over time. The task of transcription requires a lot of manual labour and a CLARIN workshop in 2017 investigated whether automatic speech recognition could help. The speaker identified 5 stages that such a tool would need to fit in with – digitising the material, transcribing with Automatic speech recognition, transcription correction, time alignment – synchronisation of audio and transcription, the addition of metadata.
The target groups for such a tool are historians and linguists, but interviews are an essential building block in research in many areas beyond these – mental health experts, law scholars, social economists etc. The transcription portal handles the process from data upload, ASR, manual correction, forced alignment, metadata. It currently works for English, Dutch, Italian and German and more information can be found on https://oralhistory.eu/.
The following morning I attended the session on ‘Society, Media, Politics, Engagement’ to start with, and the first paper was ‘Quantifying Complexity in Multimodal Media: Alan Moore and the “Density” of the Graphic Novel’. The speaker presented research that was part of a four-year project that is running until 2020, that has also resulted in image annotation software that was demonstrated on a poster. However, the current paper discussed stylometry, something that is more commonly associated with literary texts but here is meaning ‘visual stylometry’ which first began in digital art history, for example selecting brush strokes in top half of the painting to try and identify the painter, like the use of ‘stop words’ as a means of identifying an author. There have been other studies, such as frame to frame changes in visual activity in Hollywood cinema, excluding sound and text, or a thematic study of The Wire based on audio content via subtitles, looking at character speech. The current project is looking at the visual style of book length graphic novels – looking at genre classification, where the technique worked well, and author identification, which worked less well.
The project compared the style of different genres of graphic novels based on visual styles, looking at density, by which they mean ‘complexity’, although these are not quite the same. They looked at density as dynamics of complexity and how this ebbs and flows throughout a text. The speaker mentioned the graphic novels Maus and Watchmen and how critical appreciation appears to relate to complexity, but are they really linked? The speaker mentioned the Bonn bibliography of comics research (http://www.bobc.uni-bonn.de/), which includes the number of citations mentioning comics. Maus and Watchmen get about a quarter of all citations.
The speaker’s project compiled a corpus of graphic novels comprising 260 novels and 65,000 pages. From this 44 novels were chosen due to OCR issues with the others. The speaker researched three visual measures (Shannon entropy, number of shapes and panel structure) and three textual measures (number of words, word length, normalised type-token ration) and the research showed that graphic fantasy including superheroes are the most ‘dense’, due to lively, complex visual layouts. However, if looking at textual measures alone then non-fiction is the most complex. The speaker stated that there appears to be a correlation between complexity and canonicity, and the number of citations show a similar picture. Canonical graphic novels show higher complexity. The speaker also mentioned new research involving training a neural network to identify balloons, captions and panels. This is working very well, but only for Western comics at the moment. Text recognition for comics is an ongoing challenge due to hand drawn text. Training text sample sizes are small, there is a difficulty in separating words out and text and images interweave. The project used Tesseract 4, then tried Calamari OCR (https://github.com/Calamari-OCR/calamari). Calamari results far exceed Tesseract, but there were still issues with text line separation.
The second paper of the session was ‘Deep Watching: Towards New Methods of Analyzing Visual Media in Cultural Studies’, which discussed deep learning methods for analysing visual media. This approach is linked to Moretti’s distant reading – it automatically assesses features of the medium and undertakes quantitative analysis of it. The speaker state that artificial neural networks have been around for about 6 years but are getting some interesting results now. ‘Distant viewing’ is a combination of basic methods such as identifying blocks of colour and deep learning methods, focusing more on ‘deep methods’, such as when autonomous driving identifies features or the automatic recognition of people from CCTV. The speaker discussed the reading of the human body as a sign, looking at cigarette cards of movie stars from the 30s and the differences in presenting male and female actors. The project used Facebook’s Detectron software (https://github.com/facebookresearch/Detectron) for pose estimation and openFace (https://cmusatyalab.github.io/openface/) for expressions. These tools enabled the researchers to get a frame for posture, showing the orientation of the face, and key points of face, eyes, nose, mouth, with posture in 2D and the face in 3D. The project also used PxPlot from Yale’s DH Lab (http://dhlab.yale.edu/projects/pixplot/) as a means of displaying, navigating and zooming around images. The project was able to do things like extract smiling people and to show clusters, using different colours to show women and men. In terms of body posture the project didn’t find any major different, which the researchers didn’t expect. However, facial expression assumptions were correct – with men frowning more often and close mouth smiling most likely for women. Long flowing gowns were a challenge for the posture mapper and beards in 1920s-30s were confused with an open mouth.
The researchers also looked at representations of Stepan Bandera, a Ukranian nationalist in 2nd world war and rather dubious figure. There has been an upsurge in using him recently, during the Crimea crisis, where he’s either seen as a fascist or freedom fighter. The project analysed 400 YouTube videos about him, again using the Facebook Detectron framework and MaskRCNN (https://github.com/matterport/Mask_RCNN) to automatically look for symbols. This tool can pick out symbols as well as faces and picked out national symbols, fascist symbols and Russian symbols. The project manually annotated around 1500 annotations on thousands of still frame images to use as test results identifying flags, symbols in order to train the tool. The tool could then identify symbols in videos – give an example of when symbols appear in order to identify interesting parts of videos, or to check for co-occurrences – e.g. where the Ukrainian flag significantly co-occurs with a nationalist symbol.
The final paper in the session was ‘Migration and Biopolitics in Cultural Memory: Conceptual Modelling and Text Mining with Neural Word Embedding’ which looked at the representation of Irish and Jewish migrants in Britain in the 18th and 19th centuries. The project created a corpus of literature based on scanned books held at the BL and used techniques from AI in order to look for similar terms when searching – e.g. if searching for ‘disease’ the system would come up with other alternative terms to search for. The project expected immigrants would be associated with a fear of bringing disease, contagion etc. The project didn’t necessary find this but found some examples – Irish slums etc. Tuberculosis was termed as a ‘Jewish disease’ at the time. The project created conceptual lexicons – seed terms, or key themes for the area of research. These led to concept mapping – neural word embedding. The system brought back all sorts of related search terms that might be useful to search for – other diseases we might not be familiar with now, for example.
The next session I attended was ‘Cultural Heritage, Art/ifacts and Institutions’, which was a short paper session, and the first paper was ‘Data in Museums: Digital Practices and Contemporary Heritage’. The speaker stated that ‘born digital’ data presents methodological challenges. There are many types of interactives that may need to be preserved, e.g. Videogames in museums at MoMA in 2012, cartridges from the Atari dump in the Smithsonian, and representations of London in videogames. There is also social media data which have challenges for storage and display – how to collect and display Tweets and how to talk about them as museum objects. There are also physical objects, e.g. android devices showing games like Flappy Bird. This is all difficult to fit into existing museum practices – e.g. how do you loan objects that are digital? Also object biographies – tracing the careers of museum things. For example, the Mona Lisa generates a huge amounts of data such as on museum websites, in aggregators, in wikidata, external things like Wikipedia, online articles, social media. We should be preserving information on how an object is used and perceived throughout history, e.g. the Mona Lisa with a football shirt on – should it be French or Italian? How should museums store and visualise this information? It’s all ephemeral and we need to think about storing it now – e.g. MySpace was very popular but its data is mostly all gone now. Instagram or a Twitter feed can be like a modern visitor book – Museum’s need to keep a record of this.
The second paper in the session was ‘OpenTermAlign : Interface Web d’alignements de vocabulaires archéologiques hétérogènes’. This looked at the issue of data heterogeneity in a project that is part of ARIADNE Plus (https://ariadne-infrastructure.eu/). The starting point for the project was a cluster of Archaeological databases (sites and objects) for scholarly use, looking at the digital challenges. The project chose to represent some aspects of the archaeological research as linked open data and produced a tool as a means of editing and integrating the terminology, called Heterotoki (https://github.com/ponchio/heterotoki). This used OpenTermAlign (https://masa.hypotheses.org/opentermalign), SKOS and an RDF triplestore to bring data from many databases together and to allow users to discuss the semantics using the interface. There are currently 4 databases and 1,200 terms. It uses the thesaurus francais, which contains 30,000 concepts and they have also tested it using Iconclass. The tool includes a term to term concordance, e.g. if there is a definition in the source and in the target, they can be matched up, even if they are different. The user interface allows the user to compare terms, record whether are synonyms, or if they are different and can provide translations to other languages.
The third paper in the session was ‘Unthinking Rubens and Rembrandt: Counterfactual Analysis and Digital Art History’. The speaker stated that there is an over-representation of Rubens and Rembrandt in Dutch art history, and that most of Dutch art is not like them, even from the Golden Age. Looking at Worldcat and wikiData shows massive focus on these two painters. The speaker wanted to look at painting as an industry in the 17th century and therefore needed to know about more painters. There were about 10,000 painters active in this period in the Low Countries. The ‘Ecartico’ website (http://www.vondel.humanities.uva.nl/ecartico/) was developed by the speaker and has information about all these painters and their social networks. It provides facilities to visualise the social networks as network diagrams, for example showing connections between Rembrandt and other painters, collectors, dealers. This can be shown because there’s huge amounts of documentation about Rembrandt – 30,000 books on him etc, but we need more information about other painters too. It’s useful to remove Rembrandt and Rubens from the visualisations to focus more on the painters in the shadows.
The next paper was ‘Cooking Recipes of the Middle Ages: Corpus, Analysis, Visualization’, and was quite closely related to the first paper, using the same technologies. The slides from the talk can be found at https://tinyurl.com/corema-dh2019. The project wanted to represent historical knowledge of food as linked open data. It looked at cooking recipes in 80 manuscripts, around 8000 recipes in Early High German, Middle French and Latin. The project wanted to trace the origin of the recipes and their relationships, showing for example the migration routes of recipes to see if the idea that French cooking traditions have heavily influenced German. The project produced a network diagram showing the relationships between recipes, and semantic web technologies were at the core of everything. RDF was used to record ingredients, dishes, instructions, tools, time, kitchen tips, cultural and religious aspects, allowing data to be interlinked and joined up to other RDF datasets. Wikidata and the food ontology (https://www.ebi.ac.uk/ols/ontologies/foodon) were useful starting points for the project, and it needed to be language independent as they were dealing with 3 languages. Lots of the data in this can be used for historical recipes and all the connections from cows to milk to beef etc already present. As with the first paper, the project used Heterotoki and OpenRefine to align data – e.g. French and Austrian data.
The final paper in the session was ‘Lessons Learned in a Large-Scale Project to Digitize and Computationally Analyze Musical Scores’ and the speaker discussed some of the issues encountered when developing https://simssa.ca/ – a big project involving dozens of institutions that has funding from 2014 to 2021 and involves 125 researchers. They developed a single framework to automatically transform images of music into digital symbols using machine learning and statistical research. They made progress but also some missteps and the speaker wanted to discuss some of these. For example with dataset construction when combining digitised data from various sources this can lead to erroneous conclusions – false patterns or patterns obscured in datasets that don’t capture the information. Selection and balancing of the data are essential as there can be problems if you don’t have the full range or uneven class distributions. There can also be encoding problems, leading to confusing data. The speaker discussed machine learning vs deep learning and training models on huge datasets as opposed to training on hand crafted, smaller datasets. Recently there has been an emphasis on deep learning as this has been hugely successful but it has issues – it needs huge datasets and this can be a problem with historical data. Many deep learning systems are ‘black box’ and we get data without knowing how the classifier works, but how the data has been classified as important or differentiated is important to know. The project used Jsymbolic (http://jmir.sourceforge.net/jSymbolic.html) to extract 1500 feature values from musical scores.
For the next session I returned to ‘Space Territory GeoHumanities’, which was another short paper session. The first paper was ‘World-Historical Gazetteer’, which presented an overview of the project (http://whgazetteer.org/). It’s a linked open data resource and the project is running from 2017 to 2020. The first data is due to be released in a couple of weeks and the resource uses MapBox / Leaflet for its map interface. It’s following on from the Pelagios project which created the Pleiades Gazetteer (see https://pleiades.stoa.org/docs/partners/pelagios and https://peripleo.pelagios.org/).
The project will allow users to follow both places and ‘traces’, with traces being records of historical entities of any kind for which the setting (location at a point in time) is of interest – people, events, works. The project is mapping and analysing the ‘where’ of historical sources. It uses a linked places format that is taken from Pelagios and will provide both a GUI and an API. Contributors will be allowed to publish data on the platform, which will be good for smaller projects. The linked places format is GeoJSON but adding ‘when’ – timespans, periods, labels, durations. It also resolves to RDF as well. Contributed data records can match to existing records and can link through automatically – e.g. joining Byzantium and Istanbul.
When uploading a dataset users can start a reconciliation task against wikidata, augment records with geometry, pick out potential matches to existing records. The tool augments data without touching the uploaded records.
The second paper was ‘Norse World – The Complexities Of Spatiality In East Norse Medieval Texts’. This project had a very pleasing map interface that presents data about the Norse perception of the world (see https://www.uu.se/en/research/infrastructure/norseworld). It is a 3 year project currently in its last year and is creating an online searchable index of foreign names in medieval Swedish texts. It uses MySQL, and Leaflet, and offers an API with JSON, GeoJSON and JSON-LD outputs. Data can be downloaded as a CSV too. Data comes from Old Swedish and Old Danish texts from 1100-1515, looking at all types of text other than biblical texts. There are 215 sources, such as romances, travel stories, saint’s lives. It was all done by hand as OCR on the texts was not possible. The project extracted all ‘Foreign names’, meaning anything outside of modern Sweden and Denmark. All variants of place-names are recorded, linking through to one modern English form – e.g. Egyptaland and Egyptus link through to Egypt. The project decided to just use modern borders and places as borders and places change over time. Also, a book might mention ‘Paris’ but would have no clear idea where it actually was. Sometimes it was difficult to interpret or disambiguate place-names and so these were given levels of certainty: unknown, educated guess and commonly attested. For example, ‘Weronia’ – could be ‘Verona’ but the context says it is a castle outside of Rome. Original forms are linked through to information on the work – description, source, materials, which edition, links through to online versions. Non-name spatial references e.g. adjectives like ‘German cloth’ are also tagged.
One really nice feature of the map is that it can present several searches at the same time, with colour coded results and option to remove one or more result-sets. It’s a very nice facility and I might have to implement something similar in future.
The next paper was ‘Al-Ṯurayyā, the Gazetteer and the Geospatial Model of the Early Islamic World’, which demonstrated https://althurayya.github.io/, which is a very nice resource. It includes over 2000 toponyms and route sections, encoded in GeoJSON. It includes information about different names, languages, settlements and administrative classifications. Name variants are available in Arabic and English and are colour coded on the map. It’s possible to visualise settlements, routes, itineraries, networks etc on the map, which also considers the connectivity between places and spatial relations. Routes are represented in GeoJSON as line strings and there are maps of connected nodes and clusters. Maps demonstrate networks as they were at the time, not modern boundaries and different sizes of markers are used for different sizes of settlements. There is also a feature to show pathfinding – showing optimal paths between settlements with different options – e.g. safety, higher number of waystations, shortest. You can specify an itinerary for a route, to plot travelogues, for example, which might help pinpoint other stops a writer might have stopped at but were not mentioned. There is also a ‘network of reachability’ from a specified point – creating clusters of places reachable in a specified amount of time, which also shows the unreachable areas. This is colour coded depending on how many days it would take to reach the places. This can be used to show the influence of places, the spread of power or demonstrate the geographical limitations of a centre. All of the data comes from one single atlas – one snapshot of time. In future more data may be included.
The fourth paper of the session was ‘Climate Event Classification Based on Historical Meteorological Records and Its Presentation on A Spatio-Temporal Research Platform’ which was a project based in Taiwan that is tracing the impact of historical climate disasters. The project looked at meteorological records from Qing Dynasty (1644-1795) and classified these records – looking at time, areas and event categories. The texts were from the East Asian historical climate database and dealt with 36,000 records from 1647-1795. The data was preprocessed to make it suitable for machine reading, for example replacing dates with modern dates, removing place-names and punctuation. Text was transformed into vectors using word2vec (see https://skymind.ai/wiki/word2vec) and a visual, map-based interface was created that showed clusters for event types such as floods, winds, crop failures. The presenter showed the URL for the project but it was a very long link and unfortunately wasn’t on screen for long enough for me to make a note of it. The speaker demonstrated how the interface could be used to show extreme cold conditions during the period.
The final paper of the session was ‘How to Better Find Historical Photographs in an Archive – Geographic Driven Reverse Search for Photograph’. The speaker stated that there is a huge amount of historical photographs with important spatial information that can be useful for architecture, history etc but often these can’t be used fully as the photographs are not well documented or assigned to specific geographical locations. Many depicted buildings no longer exist – e.g. due to WW2 – and photos can help to reconstruct historical landscapes. The project focussed on gaining spatial metadata for undocumented, unstructured photos. It used machine learning, then applied crowdsourcing to get more information and georeferencing, then researchers can use the data to make 3D reconstructions of buildings and archive the data. For machine learning the project used a computational neural network trained on historical photographs. It achieved a 97% success rate in the classification of photos into landscape, building, groups, portraits etc. The project created a website (https://photostruk.de/explore) where photos are available, with a map interface built with OpenLayers, and filter and sort options to bring back photos by category, time period etc. Using the map people can explore places that have now disappeared. Archaeological data is included and there’s a timeline to show how things change across the time period. The website was mainly designed for the local populace, to allow them to open photos and georeferenced them – not just location of the photographer but place object markers to show items in the photos and where on a map the camera was positioned. People can submit personal information to document the photos too, and can drag and drop markers on the photo and on the map to tag places in both, linked together. The team will then work on reconstructing buildings in 3D, although the website only launched last month so things are still at an early stage.
It was then time for the poster session. One especially useful poster I saw was about handwriting recognition. It mentioned a tool for this that’s available at https://transkribus.eu/Transkribus/ and discussed handwritten text recognition (HTR) https://github.com/githubharald/SimpleHTR with tensorflow https://www.tensorflow.org/overview that uses a neural network. I may have to investigate these facilities further.
The next session was another plenary – the Busa lecture by Tito Orlandi titled ‘Reflections on the Development of Digital Humanities’. The speaker gave an overview of Digital Humanities (or ‘Humanities Computing Digital Humanities, HCDH’ as he called it) from its earliest days in the 1960s through to more recent times. He stated that in the early days humanities researchers were generally against the use of computers, but that now computers are ubiquitous, and there has perhaps been some regression in humanities research due to reliance on computers. The speaker stated that despite the ubiquity of computers we still need to involve a researcher for it to be DH – if it’s all automated it’s not humanities research and a DH scholar must be capable of building or verifying the computational tools they use.
At the start of the final day of the conference I attended the session on ‘Cultures Literatures and Texts’, and the first paper was ‘Operationalising Ambiguity; Mapping the Structural Forms of Comics’. The talk was given by a PhD student who presented his research on how comics produce meaning, perceiving patterns between signifiers and structures. The speaker looked at on graphic novel – ‘Fun Home’ by Alison Bechdel. It’s non sequential, with a small number of core characters and complex text. The speaker discussed how a panel in a comic articulates time – a pregnant moment or a snapshot in time. The same conversation can be spread over as few or many panels as an author wants, and the author can ensure that a particular character speaks first in each panel, and can alter details of the background to include / exclude panels as required. The author may include things in the background of panels that have significance later on, or ensure that the role of teacher always speaks first in panels. The speaker marked up pictorial elements in the images to denote what the image signifies and to log the roles each character plays in a panel and looked at most frequent roles by scene. There weren’t really many technical aspects to this, with everything being marked up manually in a spreadsheet.
The next paper in the session was ‘Classification of Text-Types in German Novels’. The speaker looked at three different types of sentence: descriptions, narrative structures and argumentative in order to ascertain the connection between literary subgenres and text types. The project started with texts (bottom up) and figured out how they could be categorised, creating simple annotation guidelines that could be used by a non-expert. Sentences were tagged when they matched the description of the type – the description of a physical object, the representation of a chain of events or the explanation and justification of an abstract idea. An annotation tool presented people with the sentence with one sentence before and after for context and enabled them to tag the sentence as one or more types. There was also a fourth option for ‘unclear’ or ‘unknown’. The dataset was German novels – a subset of the DROC corpus. A random sample of 30 novels was selected and for each 1% of continuous text was pulled out. This was then given to 3 annotators giving 1173 instances. Three people were chosen in order to see how well the process works and how difficult the task is. The results showed that the annnotators often didn’t agree. The most frequent option was narrative, and argumentative and unknown came up together a lot. The researchers derived 2 datasets – one where all annotators agreed. This cuts the dataset into half or more – 830 instances. The other dataset was where a majority agreed – 2 out of 3. There were 1503in this set, so close to the whole dataset.
The researchers wanted to use machine learning to apply the same classification in an automated fashion. They looked at token based features (e.g. part of speech), discourse and modal particles (indicator words), word vectors and clusters, germa-net (wordnet) generalisations and fastText vectors with a zeta score to give clusters that were visualised with different colours. Classification used 2 models, support vector machine (traditional machine learning) and deep learning techniques using a recurrent neural network. The researchers did a parameter study on both models and the Support Vector Machine worked very well. To evaluate the results the researchers looked at randomly sampled 10% sets, and this demonstrated that the results were quite stable. Linear support vector network had an 86% accuracy while a neural network was less accurate, with an 80% accuracy for stripped dataset. The reason for this is that it needs more training as there are too few data points so it was hard for it to learn. The researchers concluded that non-linear models like neural networks were not the best choice if the dataset is small and can be solved with a simpler linear model.
The final paper in the session was ‘Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)’. The speaker noted that a common difficulty is that we often just use our own defaults when approaching a problem and these often don’t rely on empirical evidence of what’s best. This is particularly true of cluster and network visualisations in stylometry (e.g. using Stylo) where clustering techniques haven’t been evaluated systematically. We don’t have data on if these default views work best. The experiment setup featured test datasets to see how clustering works. There were two different tasks to solve with clustering, and two different types of corpora. One set involved stylometric benchmark corpora in German, French and English, featuring 75 novels and 25 authors (3 novels per author). The algorithm had to sort instances into 25 clusters correctly. The other set involved sorting into two correct clusters using 17th century French drama and Latin texts from the gold and silver ages. The paper involved a lot of calculations and algorithms that I’m not at all familiar with and a lot of what was said went over my head, but the conclusion seemed to be that the best measures were the ward linkage method for clustering, and cosine delta was the best distance method. Networks tended to give less clusters than there should be and the Louvain method was the best, but it depends on the number of classes.
The next session I attended was on ‘Tools Interfaces and Infrastructures’ and the first paper was ‘Finding Visual Patterns in Artworks: An Interactive Search Engine to Detect Objects in Artistic Images’. The project studied pose and depiction in paintings. With a manual search we might find this by looking at other artists or other paintings by the same artist, but this means the results are based on the prior knowledge of the expert and will therefore be limited. Computer aided searching can find more examples in other techniques and styles over time, but how can this be achieved? The project involved the Computer Vision Group at Heidelberg (https://hci.iwr.uni-heidelberg.de/compvis), who created a tool to identify perceptions, artistic relationships, motifs and their meaning and development of form. The speaker pointed out that the computational method of understanding image contnet began with object detection – HoG (histogram of oriented gradients) to find the gradients of edges, for example to identify a cartwheel based on its edges. This worked well with pronounced outlines but not much more. Recently deep methods and neural learning methods have come into use, and this is what the current project used. What is a ‘feature’? It is an image region, a vector encoded numerically, feature vector. Features contain information about colour, shape, form but also content – e.g. if it’s an eye, a nose etc. For this project computer scientists worked directly with art historians in the same room, a situation that worked very well. Together they developed requirements and identified challenges. They produced a purely visual search with no reliance on metadata at all. The requirements were that a user must be able to search for objects, object parts, entire images, upload new images, work with existing ones, work with diverse datasets, that the system must be fast and bring back identical results each time.
The tool is split into an offline stage (image database, feature extraction, generation of regions, indexing of results in a database to increase speed) and an online stage (where users can select a region of interest with a bounding box, feature extraction nearest neighbour search, search ranking based on similarity score and cosine distance, retrieval and then user can perform analysis). The user interface is really very nice, allowing the user to select region in a painting, tweak it, then find all images with same thing and select favourites from results. For example, the user can drag a box round a medal round a person’s neck and the system finds all paintings featuring people with medals round their necks.
Images are added and processed through a browser based CMS, and the project has used several test datasets including 1500 works of the Bruegel family, demonstrating the searching for a ship, another example was looking for a hat in 1000 images of street art. The system can also be used to identify engravings, with cuneiform on clay tablets the system can find a particular symbol. To evaluate performance the project collected a dataset of 1001 images and manually annotated these – e.g. praying hands, cross. There were an average of 210 annotations per category. These were then compared to the automatic identification of feature.
The speaker pointed out that the tool has the potential to alter research in art history as visual searches can be used by non-experts – you don’t need to know the terminology and results are not biased by prior knowledge. About 1000-3000 images are required to train the network. The team are currently working on scaling at the moment as the computational requirements are high: above 10-20,000 images the computation costs are massive and searches take a long time and the system it not yet available to be used by multiple users at the same time.
The next paper in the session was ‘Sublime Complexities and Extensive Possibilities: Strategies for Building an Academic Virtual Reality System’. This is a system called OVAL – the Oklahoma virtual academic lab. It’s built using the Unity framework and it’s available on GitHub. The project wanted to build a VR system that can be used for very different disciplines – e.g. architecture to move through things, biologists to zoom into things. VR is now becoming more and more affordable, first with the Oculus Rift and now the Quest, which is wifi enabled so has no wires. The project built a measurement tool in VR e.g. dragging boxes round a manuscript image in VR, and created an interface that can display manuscripts in VR and apply different filters to the pages – infrared, RGB etc, and compare them all by walking around them in VR. Objects can be moved around in 3D and users can annotate objects too. The speaker pointed out that this can work well for transcription – see whole pages as big as a building, then transcribe using voice recognition on a phone, reading out the letters. They also built a collaborative – multi-user space – up to 100 people in the same ‘room’ in VR and created an interface for trial techniques for law students, involving the reconstruction of crime scenes which makes it much easier to process where things are in relation to each other.
The final paper of the session was ‘Mapping the Complexity of Ancient Codices in Databases: the Syntactical Model’. The speaker pointed out that codices have one or more layers of complexity. Scribes add extra texts and sometimes the later parts of a codex are older than the first part. Often different scribes work on the same codex. Therefore the history of the codex is messy and the usual data description models for manuscripts have problems. They tend to just give the title and the date. But often one date is insufficient. E.g. giving a date of 1595 but this only relates to the last addition, which is misleading as looks like it applies to whole codex. Generally information consists of title, shelf mark, lists of contents, physical description, origin and bibliography. Traditional scholarly description groups everything together without considering the historical layers, which can give incorrect search results. A new way to describe manuscripts is proposed – a syntactical model. Most codices are made up of different layers – ‘production units’ and the primary level in description should be production units. The website https://betamasaheft.eu/ uses this approach. The proposed syntactical model is XML based using TEI – <msPart>. The speaker demonstrated some visualisations of production units that were interactive – allowing people to click on units and view the details of it. A search using the new model data finds all the texts rather than just relying on a single date for the whole manuscript.
The final parallel session of the conference that I attended was ‘Digital Humanities Theory and Methodology’, and the first paper was ‘Linking the TEI: Approaches, Limitations, Use Cases’. The paper was about the coexistence of TEI and linked open data and how to get TEI to interact with RDF. There are 569 elements and 505 attributes in TEI P5, with further customisation available via an ODD file. The speaker stated that TEI is not exactly a standard, but something between a standard and consensus – TEI compliance does not entail interoperability as there are many different solutions to the same problem. This is an issue when linking to RDF. RDF is a data model for web of data. It is multi-graph – source node, type of edge, target node, all with their own URIs. This is what’s known as a ‘triple’ and by chaining them together we can make complex graphs. Why RDF? It’s a flexible and generic mechanism for creating cross references and makes it easy to integrate with other datasets. It facilitates reusability, sustainability and replicability and you can build on existing LOD technology. It also enforces explicit machine readable semantics – something TEI doesn’t do – and you can use URIs to refer to data. It has been adopted by many communities, e.g. the linguistic linked open data cloud, but TEI and LOD do not converge. It’s not possible to integrate RDF triples in a single TEI document that is both TEI and W3C compliant. You can build your own solution but nothing is formalised. There is a TEI native way. TEI includes attributes with URI arguments. And these can be used to refer to RDF targets e.g. ‘@ref’. However, using this method there is no representation of the source or predicate. It’s also difficult to include different interpretations – no easy way to have alternative readings, provenance, uncertainty. Another way is using inline XML. This is TEI endorsed but is not LOD compliant. Existing TEI elements can be used, e.g. <graph> <fs> <link> <relation>. But these can be used for other things too so things can easily get confused. Also you can end up using attributes in ways in which they were not intended e.g. ‘active’ and ‘passive’ for ‘source’ and ‘target’ and using <relation> this is restricted to named entities. A third solution, which is W3C compliant but is not endorsed by the TEI, is augmenting XML with RDF data structures. This requires several new attributes to be added into the host vocabulary, e.g. ‘@about’ for the subject URI. You can take TEI fragment and add the attributes, and TEI plus RDFa can then be converted into HTML5 plus RDFa, which can then be queried with an RDFa parser (e.g. PyRDFa).
The second paper in the session was ‘Towards Tool Criticism: Complementing Manual with Computational Literary Analyses’. Unfortunately the speaker had to pull out and the paper wasn’t given, but the slides are going to be put up on https://www.etrap.eu/ and can be downloaded here: https://www.etrap.eu/wp-content/uploads/2019/07/Dystopian_Novels_compressed.pdf
The final paper in the session was ‘Translating Networks’. This highly enjoyable paper traced the variety of ways in which network visualisation and analytics have been used across many DH projects. The speaker analysed the network diagrams from the past 3 DH conferences to compile interpretative practices – not looking to see what is right or wrong. The full paper has lots of nice diagrams and can be found here: https://halshs.archives-ouvertes.fr/halshs-02179024/document. The speaker pointed out that the way that we read networks has changed over time. Originally the fewer the number of lines overlapping the better, and edge crossings were minimised. A ‘diagrammatic’ network is a diagram that we can use to follow all the paths, while a topological graph is a structure to show patterns and clusters. Recent DH practice is using the visualisation in a self-sufficient way – ‘this is an image of my data’ or as a means of helping to visualise the argument or highlight the structure of a text or compare layouts, e.g. over time, or to visualise communities, map a structure, compare densities or monitor modularity clustering by algorithms. A common technique is to assign nodes a position using force alignment, resulting in clusters, but these can be difficult to interpret, compared to manually positioning points. The speaker pointed out that graph metrics go back more than half a century, starting with Freeman (1979) ‘centrality’. Networks can be used for statistical analysis e.g. the size of the graph, number of nodes and edges as a ranking tool, or for looking at density – number of edges, or the diameter of entire network or average path length can be used. Networks can focus on connectedness, cluster detection and the global clustering coefficient, or can be used for local measures focussing on a single node and its neighbours and its position in the overall graph. For example, a literary network counts the number of times one character speaks to another. Betweenness centrality and closeness are also often used, and an Eigenvector network can be used to show prestige or the influence of a selected node. Local clustering coefficient can show participation in a group, or the loneliness of a node, while shortest path tests can show the relations of nodes pairs. ‘Cliques’ can show individuals that know each other. The speaker pointed out that no one approach is ‘right’ or ‘wrong’ and many of the different approaches can be combined or enriched by non-structural categories too. The way practices relate to theory can be problematical. Visualisations are mediations and there is no ideal method. Network diagrams can differ massively depending on how you decide to generate the network. Same data can have very different layouts
The final plenary of the conference was ‘Digital Humanities — Complexities of Sustainability’, which was given by Johanna Drucker of UCLA. The speaker discussed sustainability (a system in which resources are replenished at the same rate as they are consumed) and complexity (nonlinear, non-deterministic) and how / why these terms relate to Digital Humanities. The speaker pointed out that there was a time when people thought that DH could ‘save’ the humanities, but that this was delusional – the humanities are still in danger. DH people have to be concerned with methods, not just computation or empirical sciences but cultural and historical knowledge. Sustainability is important at a number of scales. The speaker discussed the website http://artistsbooksonline.org/ which contains 300 fully digitised books and lots of metadata, but to keep it online it needs to be migrated, and its functionalities are gradually diminishing. It’s not sustainable – there is a need to migrate forwards. Another example is ‘History of the book’ https://hob.gseis.ucla.edu/. Between the start of the project and its launch the library had demanded the website was based on Drupal and then given up on Drupal. It was a similar story with the DH101 coursebook. This had a WordPress site but had to be abandoned due to spam so now only exists as a PDF (http://dh101.humanities.ucla.edu/wp-content/uploads/2014/09/IntroductionToDigitalHumanities_Textbook.pdf).
Technologies all become obsolete. We use tools and platforms but these are all generally discipline agnostic – not developed specifically for the humanities. Some were developed for DH, but can still be used for anything. So what constitutes humanities methods? It’s what we add, the ethics, the intellectual property, the communities. Sustainable principles.
At institutional level what is sustainable? Systems dependence, standards to be aware of in order to be integrated into library systems, for example. Sustainability needs ongoing partnerships, e.g. with libraries. The continuity of projects is not always set by the researcher. Visibility is also important too – how to expose the research we’ve done. Digital projects don’t appear in WorldCat or library catalogues. Programming languages also die – Pascal, Flash. Storage media has longevity issues too – bit rot. We also suffer from ‘upgrade addiction’. There is also intellectual obsolescence too – things go out of fashion, knowledge too. Research becomes obscure and difficult to find. Things also become ubiquitous – hypertext used to be very exciting, now it’s just everywhere. Things have their time and moment. The technology industry in general has a serious sustainability problem at a global scale. We can’t produce high-tech instruments without rare metals and minerals, which can involve child labour and environmental disasters.
Are humanities and computation in opposition? No. Computation is a formal, deterministic, repeatable, disambiguated system. A formal system. Computation requires this. On other side we have the genitive, probabilistic, cultural artefacts, objects etc. We should be wary of seeing sustainability as a thing. It’s not finished, it’s incomplete. We need to balance the costs: the moral, ethical, ecological costs of high tech with justification for the work that we do. It has to have value, political value, ethical value. There is a productive tension between computational certainty and the generative activity of critical engagement. This complexity is justifiable and sustainable.
After the conference I staying in The Netherlands for a further week on holiday, so there will be no report next week.
Week Beginning 1st July 2019
I’d taken Thursday and Friday off this week due to getting married on the Friday, so had to squeeze quite a lot into three days this week. I dealt with a few queries from people regarding several projects, including Thomas Clancy’s Iona proposal, the Galloway Glens project, SCOSYA, the Mary Queen of Scots letters project, and a couple of new projects for Gavin Miller, but I spent the majority of my time on the DSL, with a little bit of time on Historical Thesaurus duties.
For the DSL I fixed some layout issues with the bibliography on the live site and engaged in an email conversation about handling updates to the data. This also involved running the script Thomas Widmann had left to export all the recent data from the DSL’s in-house server, which we hoped would grab all updates made since the last export. Unfortunately it looks like the export is identical to the previous one, so something is not working quite right somewhere.
The bulk of my DSL time this week was spent continuing to develop the new API, looking for the first time at full-text searches. The existing API set up by Peter uses Apache Solr for full-text searches, but I wanted to explore handling these directly through the online database instead, as Arts IT Support are not very happy to support the set-up that the existing API currently uses (the API powered by the Python-based Django framework and full-text searches by Solr).
The full-text search actually requires three different versions of the entries, all with any XML tags removed: the full entry; the full entry without the quotations; the quotations only. My first task was to write a script to generate and store these different versions. So far I’ve focussed on just the first, and I wrote a script to strip all XML tags and store the resulting plain text in the database for each of the 89,000 or so entries. This took some time to run, as you might expect. I then set the field containing this text to be indexed as ‘fulltext’ in the database, which then allows full text queries to be executed on the field. Unfortunately my experiments with running queries on this field have been disappointing. The search is slow, the types of searching that are possible are surprisingly limited, and there is no way to bring back ‘snippets’ of results that show where the terms appear (at least not directly via a single database query).
It is not possible to use single character wildcards (e.g. ‘m_kill’) and it is not possible to use an asterisk wildcard at the start of a search term (e.g. ‘*kill’). It also doesn’t ignore punctuation, so a search for ‘mekill’ will not find an entry that has ‘mekill.’ in it. Finally, it only indexes words that are over 3 characters in length, so a search for ‘boece iv’ ignores the ‘iv’ part. What it can do pretty well is Boolean searches (AND OR and NOT), wildcard searching at the end of a term (e.g. ‘kill*’) and exact phrase searching (e.g. “fyftie aught”).
I would personally not want to replace the current full-text search with something that is slower and more limited in functionality so I then started working with Solr, which is not something I’ve had much experience with (at least not for about 10 years, back when it was horribly flaky and difficult to ingest data into). It turns out that it is pretty easy to set up and work with these days, and I set up a test version on my PC, ingested a few sample entries and got to grips with the querying facilities. I think it would be much better to continue to use Solr for the full-text searching, but this does mean getting Arts IT Support to agree to host it on a new server. If we do get the go-ahead to install Solr on the server where the new API resides it should then be fairly straightforward to set up fields for full text, full text minus quotes, quotes only and to write a script that will generate data to populate these fields. I will document the process so that whenever we need to update the data in the online database we know how to update the data in Solr too. The full-text search would then function in the same way as the current full-text search, in terms of Boolean and wildcard searches, but will also offer an exact phrase search too. It would also return ‘snippets’ as the current full-text search does too. All this does rather depend on whether we can get Solr onto the server, though.
For the Historical Thesaurus I checked through a new batch of data that the OED people had sent to Fraser this week, which now includes quotation dates and labels. It looks like it should be possible to grab this data and use it to replace the existing HT dates and labels, which is encouraging. I also updated a stats script I’d previously prepared to link through to all of the words that meet certain criteria. I also worked on a new script to match HT and OED lexemes based on the search terms tables that we have (these split lexemes with brackets and slashes and other such things up into multiple forms). I realised I hadn’t generated the search terms for the new OED lexeme data we had been given so first of all I had to process this. The script took a long time to run as almost 1 million variants needed to be generated. Once they had I could run my matching script, and the results are pretty promising, with 15,587 matches that could be ticked off (once checked). A match is only listed if the HT lexeme ID is linked to exactly one OED lexeme ID (and vice-versa). We currently have 85,454 unmatched OED lexemes in matched categories, so this is a fair chunk.
I also spent a bit of time helping Fraser to get some up to date stats for our paper for the DH conference next week. I’m going to be at the conference all next week and then on holiday the week after.
Week Beginning 24th June 2019
I focussed on two new projects that were needing my input this week, as well as working on some more established projects and attending an event on Friday. One of the new projects was the pilot project for Matthew Sangster’s Books and Borrowing project, which I started working on last week. After some manual tweaking of the script I had created to work with the data, and some manual reworking of the data itself I managed to get 8138 rows out of 8199 uploaded, with most of the rows that failed to be uploaded being those where blank cells were merged together – almost all being rows denoting blank pages or other rows that didn’t contain an actual record. I considered fixing the rows that failed to upload but decided that as this is still just a test version of the data there wasn’t really much point in me spending time doing so, as I will be getting a new dataset from Matthew later on in the summer anyway.
I also created and executed a few scripts that will make the data more suitable for searching and browsing. This has included writing a script that takes the ‘lent’ and ‘returned’ dates and splits these up into separate day, month and year columns, converting the month to a numeric value too. This will make it easier to search by dates, or order results by dates (e.g. grouping by a specific month or searching within a range of time). Note that where the given date doesn’t exactly fit the pattern of ‘dd mmm yyyy’ the new date fields remain unpopulated – e.g. things like ‘[blank]’ or ‘[see next line]’. There are also some dates that don’t have a day, and a few typos (e.g. ’fab’ instead of ‘feb’) that I haven’t done anything about yet.
I’ve also extracted the professor names from the ‘prof1, prof2 and prof3’ columns and have stored them as unique entries in a new ‘profs’ table. So for example ‘Dr Leechman’ only appears once in the table, even though he appears multiple times as either prof 1, 2 or 3 (225 times, in fact). There are 123 distinct profs in this table, although these will need further work as some of these are undoubtedly duplicates with slightly different forms. I’ve also created a joining table that joins each prof to each record. This matches up the record ID with the unique ID for each prof, and also stores whether the prof was listed as ‘1,2 or 3’ for each record, in case this is of any significance.
Similarly, I’ve extracted the unique normalised names from the records and have stored these in a separate table. There are 862 unique student names, and a further linking table associates each of these with one or more record. I will also need to split the student names into forename and surname in order to generate a browse list of students. It might not be possible to do this fully automatically as names like ‘Robert Stirling Junior’ and ‘Robert Stirling Senior’ would then end up with ‘Junior’ and ‘Senior’ as surnames. I guess a list of professors listed by surname would also be needed too.
I have also processed the images of the 3 manuscripts that appear in the records (2,3 and 6). This involved running a batch script to rotate the images of the manuscripts that are to be read as landscape rather than portrait and renaming all files to remove spaces. I’ve also passed the images through the Zoomify tileset generator so now have tiles at various zoom levels for all of the images. I’m not going to be able to do any further work on this pilot project until the end of July, but it’s good to get some of the groundwork done.
The second new project I worked on this week was Alison Wiggins’s Account Books project. Alison had sent me an Access database containing records relating to the letters of Mary, Queen of Scots and she wanted me to create an online database and content management system out of this, to enable several researchers to work on the data together. Alison wanted this to be ready to use in July, which meant I had to try and get the system up and running this week. I spent about two days importing the data and setting up the content management system, split across 13 related database tables. Facilities are now in place for Alison to create staff accounts and to add, edit, delete and associate information about archives, documents, editions, pages, people and places.
On Wednesday this week I had a further meeting with Marc and Fraser about the HT / OED data linking task. In preparation for the meeting I spent some time creating some queries that generated some statistics about the data. There are 223250 matched categories, and of these there 161378 where the number of HT and OED words are the same and 100% of words match. There are 19990 categories where the number of HT and OED words are the same but not all words match. There are 12796 categories where the number of HT words is greater than the number of OED words and 100% of OED words match and 5878 categories where the number of HT words is greater than the number of OED words and less than 100% of OED words match. There are 16077 categories where the number of HT words is less than the number of OED words and 100% of HT words match, and in these categories there are 18909 unmatched OED lexemes that are ‘revised’. Finally, there are 7131 categories where the number of HT words is less than the number of OED words and less than 100% of HT words match, and in these categories there are 19733 unmatched OED lexemes that are ‘revised’. Hopefully these statistics will help when it comes to deciding what dates to adopt from the OED data.
On Friday I went down to Lancaster University for the Encyclopedia of Shakespeare’s Language Symposium. This was a launch event for the project and there were sessions on the new resources that the project is going to make available, specifically the Enhanced Shakespearean Corpus. This corpus consists of three parts. The ‘First Folio Plus’ is a corpus of the first folio of plays from 1623, plus a few extra plays. It has additional layers of annotation and tagging – tagging such as for part of speech and annotation such as social annotation (e.g. who was speaking to whom), gender and a social ranking. Part of speech was tagged using an adapted version of the CLAWS tagger and spelling variation was regularised using VARD2.
The second part is a corpus of comparative plays. This is a collection of plays by other authors from the same time period, which allows the comparison of Shakespeare’s language to that of his contemporaries. The ‘first folio’ has 38 plays from 1589-1613 while the comparative plays corpus has 46 plays from 1584-1626. Both are just over 1 million words in size and look at similar genres (comedy, tragedy, history) and a similar mixture of verse and prose.
The third part is the EEBO-TCP segment, which is about 300 million words over 5700 texts, which is about a quarter of EEBO. It doesn’t include Shakespeare’s texts but includes texts from five broad domains (literary, religious, administrative, instructional and informational) and many genres and allows researchers to tap into the meanings triggered in the minds of Elizabethan audiences.
The corpus uses CQPWeb, which was developed by Andrew Hardie at Lancaster, and who spoke about the resource at the event. As we use this software for some projects at Glasgow it was good to see it demonstrated and get some ideas as to how it is being used for this new project. There were also several short papers that demonstrated the sorts of research that can be undertaken using the new corpus and the software. It was an interesting event and I’m glad I attended it.
Also this week I engaged in several email discussions with DSL people about working with the DSL data, advised someone who is helping Thomas Clancy get his proposal together, scheduled an ArtsLab session on research data, provided the RNSN people with some information they need for some printed materials they’re preparing and spoke to Gerry McKeever about an interactive map he wants me to create for his project.