We made a big update to the content of the Anglo-Norman Dictionary this week, replacing all entries in the letter ‘U’, plus a number of other entries elsewhere in the dictionary. I’d documented the processes that I needed to follow during previous updates to the dictionary, so on the whole it was a straightforward process. However, previously I’d performed the update by downloading the database to my local PC, running all required processes there, then uploading the new version of the database to the server. The good thing with this approach is the updates are applied on my local PC so if anything goes wrong the live site is not affected. The downside is the database for the AND is over 2GB in size and it’s not possible to upload data of this size via PHPMyAdmin. Instead I need to ask Arts IT Support to run the import at the command-line. I wanted to be able to manage the whole process myself this time, therefore I investigated running the update on the server. I did of course first download the live database and ran the update scripts on my local PC to ensure no issues would be encountered. However, the version of PHP that runs on the server is newer and more strict than the version running on my local PC so I needed to make some modifications to the processing scripts that I usually run on my local PC. With the modifications in place I was then able to run the scripts on the server and replace the letter ‘U’ entries on the live site. The update also required some entries elsewhere in the dictionary to be updated and this process is handled via dictionary’s online content management system. A zip file containing entries is uploaded and a script then processes it. Last week I had to replace the library for handling zip files and this was the first time the updated ‘upload entries’ script had been used. Unfortunately some errors were encountered but I managed to sort these and after that the update was complete.
I also checked to see how large the AND’s XML dataset was, which the editors need to know for some additional work they’re doing. I’d estimated it to be around 25-35Mb whereas they’d estimated it to be around 2Gb. I’d previously written a little script that would export XML data from the database and save them as individual XML files so I ran this again for the entire database. The resulting database of just over 60,000 files too up 139Mb, although this takes up 180Mb on disk, due to lots of small files having a storage overhead that one large file doesn’t have.
Also this week I had a video call with Michelle Anjirbag-Reeve, a researcher who is applying for ERC funding and would be based in the School of Critical Studies if the funding is successful. Her proposal sounds really interesting and I’ll be involved in creating a CMS, website and visualisations if it’s successful. Fingers crossed.
I also spent a day or so this week working for the Dictionaries of the Scots Language. Before my recent holiday I’d been working on a requirements document for new search facilities for dates and part of speech, plus search result filter options and new sparklines. I completed an initial first draft of this document this week. It took quite a bit of thinking through and the document is rather long and the updates have implications for other parts of the site. I realised whilst writing the sections on filtering that I’ll need to change the way that headword searches function, as they do not currently use Solr and the filter options will rely on Solr and I therefore needed to add a section explaining this. Also, whilst thinking through the sparklines I realised we might want to cluster the data to make continuous blocks rather than using the individual citation dates and I included a discussion of this in the document too.
I also made a few more updates to the SpeechStar website this week and managed to find some time to return working on the Books and Borrowing front-end. For this I implemented a first version of the ‘On this day’ feature, which I’ve currently added to the homepage of the dev site. What the feature does is to pick out a random borrowing for the current day and display information about it, for example:
“On this day in 1829, Mr Robert Allan, a borrower at Advocates Library borrowed 2 volumes of Histoire de la Vie et de la Mort des deux illustres fréres Corneille et Jean de Witt. by Cornelis de Witt.”
The feature picks out and displays the borrower, the library, the number of volumes borrowed (if this is over 1), plus the title and the author of the book borrowed. The borrower, title and author are links to perform a search while the library is a link to the library page. There is a ‘reload’ button in the bottom left and when you press on this the area scroll ups and then scrolls down again with a new randomly selected borrowing from the day. There is also a link in the bottom right to view all of the borrowings for the current day. These are presented on a new page that also features options to select a different day and month, in case people want to see what was borrowed on their birthday, for example. On this day items on this page are listed in date order and then by library. It’s maybe not the most serious and academic of features, but I think it’s a nice addition and makes the data feel more alive.
I’m going to be on holiday again for the next two weeks so there won’t be another report from me until the week of the 14th of August.
After two weeks out of the office on holiday and at a conference I returned to regular work this week, although it was only a four-day week due to Monday being the Glasgow Fair public holiday. I spent most of Tuesday dealing with my expenses, figuring out what outstanding tasks I needed to get back to and catching up on emails and issues that had cropped up whilst I’d been away. I also spent some time writing up my rather lengthy report from the conference, which you can read in last week’s blog post.
One of the issues that cropped up is that the embedded Twitter feeds on sites such as https://anglo-norman.net/ and https://burnsc21.glasgow.ac.uk/ have stopped working (and at the time of writing are still broken), and only display ‘Nothing to see here – yet’. It looks like this is yet another occurrence of Twitter being a disaster zone these days and they’ve blocked embedded Twitter feeds. Information about the issue can be found here https://twittercommunity.com/t/again-list-widget-says-nothing-to-see-here-yet-if-logged-out/198782/205 and at the time of writing it would appear there is no work-around for this, and absolutely no official word from Twitter about the issue. This may be the end of embedded Twitter feeds and another nail in the coffin for Twitter if so.
I also fixed an issue with the Place-names of Iona project. The scripts in the content management system for managing place-name elements weren’t working properly since the migration to a new server and I managed to fix this with a couple of updates to the code.
There was also an issue with the DSL’s advanced search that had been introduced since we migrated to a new Solr instance last month. After the migration the advanced search snippets were sometimes being joined together and were occasionally far too long and I introduced a fix for this. Unfortunately the fix required the full dataset to be queried in Solr and once these were returned the results were filtered by the source. This meant that for searches that exceeded our maximum returned results limit of 500, when a source was specified in a search the displayed total number of results referred to the unfiltered total, and if more than 500 results were returned these were being limited to 500 before the source was taken into consideration. The outcome of this was that when a source dictionary was specified the total number of results was incorrect and the limited number of returned results was not 500 but however many of the first 500 were for the source dictionary in question. Thankfully I managed to sort this out and all should be behaving properly again now. Also for the DSL I investigated the entry https://dsl.ac.uk/entry/dost/depredatio(u)n which was giving an error message. This one took a bit of time to figure out, but in the end it was something simple: this is the only entry that has a closing bracket in its ‘slug’ and my script was stripping closing brackets from slugs before connecting to the API to retrieve data, therefore no matching entry was found and for this entry the XML file was therefore empty. I’ve fixed this issue now.
An issue had also arisen with the Anglo-Norman Dictionary since a recent PHP upgrade on the server. In the content management system we have a proofreader feature, which allows an editor to upload a ZIP file containing any number of entry XML files. These are then extracted and formatted as they would be displayed on the public website, only with all entries in one long page. However, since the PHP upgrade the proofreader gave a blank page when submitted a ZIP file. It was definitely not an issue with any one specific ZIP file as I tested the proofreader with other files that should work and the result is the same. It turned out that the library I use to extract and read ZIP files in the proofreader script is not compatible with the new version of PHP that the server has been upgraded to. I’ve created a simple test script that reads a ZIP file and it fails on the server but runs with no issues on my desktop PC, which runs an older version of PHP.
I then spent some time getting to grips with the replacement ZIP library but unfortunately the server does not feature this library (or any library for processing zip files in PHP since the upgrade). I submitted a helpdesk query to ask for the library to be installed and thankfully this was completed a few hours later. I could then replace my old scripts with a new one and the proofreader appeared to work again. However, it became apparent that the new ZIP extraction script was cutting larger files off at a certain point, even though I’d set the script to check for the file size of the archived file and grab all of the file up to that size. Updating the size to a much larger number had no effect either. IT turns out that when a file is extracted using the new ZIP extraction it’s actually a ‘stream’, which is a chunk of data. I was placing this in a variable thinking it was the full file, but it isn’t necessarily the case. Instead I’ve added an extra bit of code that iterates over all of the chunks of data referenced in the stream and ensures they are all added to the variable rather than only one part being added. With this update in place the proofreader now works, even with very large files.
My final task for the week was to make a lot of updates to the Speech Star website (https://www.seeingspeech.ac.uk/speechstar/). This included changing the title of the website and the ‘site tabs’ found in the top right of all three associated sites, making tweaks to the IPA and ExtIPA charts, updating the site text and adding in two videos that had been missed out.
After spending a lovely week on holiday in Austria I attended the DH2023 conference in Graz this week (https://dh2023.adho.org/). As always it was a hugely interesting and useful conference and I learned a lot through all of the various sessions I attended. There were so many parallel sessions it was sometimes tricky to know which to attend and below is a brief overview of each of the papers I listened to. Some of the papers were given remotely, and generally these worked very well, although it was not quite the same as having the speaker there in person. The biggest issue was the question and answer session after the remote presentations, as all questions could only be responded to via text-based chat. There was a sizable delay between the question being asked and the text coming through and unfortunately it never really worked very well.
The first session I attended on Day 1 was a long paper session on Geospatial Methods. The first paper was about mapping Marco Polo’s travel journals. The speaker mentioned how travel literature covers various forms, such as travel memoirs and guidebooks and stated the Marco Polo’s journals are one of the earliest known accounts, begin written between 1271 and 1295. The journals have been translated into many languages and there are questions as to whether he really travelled to Chine or whether he just took other people’s stories. The journals create a mental journey that can then be traced on an atlas to create a map. We can reconstruct and geographically trace the route. The speaker was interested in whether it would be possible to computationally extract, process and georeferenced entries in the journals to semi-automatically construct the route and how this could then augment the reader experience. The speaker used natural language processing to identify placenames using a gazetteer and machine learning, using the R package. The result was a representation of a possible route taken by Marco Polo and a visualisation of the landscape of the route was created.
The speaker used Henry Yule’s English translation of the journals from 1871, which was published in two volumes and which is freely available from Project Gutenberg. Manual annotations of the volumes was created by students for training purposes and a gazetteer was created based on the back matter in the book, which consisted of location names and modern names. The speaker tried using Flair (https://github.com/flairNLP/flair) but this struggled with historical references and also tried using motion event annotation: extracting motion verbs in the text using VerbNet (https://verbs.colorado.edu/verbnet/) and also FrameNet (https://framenet.icsi.berkeley.edu/). Attempts were made to identify when Marco Polo was the subject of the motion verbs in order to extract the origin and destination. This resulted in the extraction of 398 short sentences expressing the motion of Marco Polo for which coordinates could then be generated. This resulted in 133 map locations, which were saved as a KML file and could be compared with existing map-based routes. Other information about the landscapes that were not mentioned in the journals could be noted, such as vegetation, temperatures, elevation and other details about the landscapes. The speaker noted that in travel journals the actual travel on roads tend to be given sparse details and this approach helps provide more information. The speaker created a digital elevation model as a GeoTIFF file and then used Blender and Rayshader to generate a 3D model of the route in R. The speaker discussed ‘cost surface analysis’ and ‘least cost path analysis’ and how these could give a more nuanced interpretation of possible routes in travel journals, for example calculating the optimum path through mountains.
The second papers was titled ‘Link Visions Together: Visualizing Geographies of Late Qing and Republican China’ and discussed the visualisation of the geographies of late Qing and Republican China. This has been an eight year project that stated with line survey maps and has developed into a complex GIS platform consisting of thousands of 1:50,000 scale maps of mainly the Easter regions of Chinese territory. The speaker discussed the platform that had been created to host the maps and how this could deal with both time and space. The platform is available at https://chmap.mpiwg-berlin.mpg.de/lgtu-new/ and contains maps of provinces that have been digitised and georeferenced. The project uses GDAL (https://gdal.org/index.html) and WMTS (https://www.ogc.org/standard/wmts/) and the resulting maps can be used in various platforms such as ArcGIS. The project connects to the Local Gazetteer Research Tools (https://www.mpiwg-berlin.mpg.de/research/projects/logart-local-gazetteers-research-tools) and links through from map points to associated gazetteer manuscript images. The maps also support spatial references in IIIF, enabling links between maps and other materials such as books and paintings.
The final paper of the session was ‘Mapping Memes in the Napoleonic Cadastre: Expanding Frontiers in Memetics’. This looked at The Cadastre of France, a massive survey of land properties that was used to work out things like taxes in Napoleonic France, beginning in 1807 and continuing into the 1880s. The project looked at how the use of colours and shadows was used over time and how standards were set, for example using green shades for rivers. Most cadastres were released after the fall of the empire but standards were still followed. The speaker discussed ‘memes’ and how these are an elementary unit and replicator of culture. These can be tunes, ideas, catchphrases, fashions, ways of building arches and the speaker discussed how memes propagate themselves. The cadastres are physical instances of cartographic practices and the speaker wished to pick out normalised features of the maps, fragmenting them into smaller areas that were easier to analyse. The speaker created samples of different types of lines, shadings and other such things to try and find coherences – picking out elements that looked alike based on things such as colour distribution, morphology, texture, line width, graphical load and orientation. Then speaker generated clusters of representations, enabling cadastres created by the same organisations to be identified. The speaker then discussed automatically classifying map features into four types: the build environment, non-built, roads and rivers. It was then possible to pick out squares featuring each of these in order to compare different settlements.
The second session I attended was a long-paper session on visualising text and the first paper was ‘Mapping Antiquity in Collaboration: The Digital Periegesis Project’. The speaker discussed non-modern ways of looking at space and stated that while maps are a part of contemporary culture they weren’t always so. Maps give us a different sense of space and distance and different ideas of how places are connected, with the speaker comparing a modern map of the Mediterranean with a view of it given in an old manuscript. The speaker focussed on Pausania’s Periegesis, which was writtin around 180AD and how the power of modern annotations can help to interpret old maps and to enable collaboration. The speaker pointed out, however, that annotation is not new – medieval manuscripts were often full of annotations, often taking up more space that the actual text. The speaker’s project is using TEI XML to mark-up annotations and reuse them, using the Recogito tool from Pelagios (https://recogito.pelagios.org/) to semantically enrich texts online, for example adding in references to places, people and events. Using the tool these can then be aligned with gazetteer entries to produced linked data without requiring any technical knowledge and maps of places mentioned in texts can be automatically generated. The speaker gave the example of how the first volume of Periegesis was set in Attica but how many other places beyond this area were mentioned. The speaker discussed challenging the cartesian view of places but also the itinerary view – how connections and relations can be drawn between entities to show network visualisations. The speaker’s project is using NodeGoat (https://nodegoat.net/) to generate network diagrams and discussed another ancient Greek project called Manto that also uses this platform for visualisations (see https://manto.unh.edu/viewer.p/60/2616/scenario/1/geo/) and discussed the power of linked data – how it’s possible to highlight a word in a text and see images related to it found in other resources, or related maps. The speaker also discussed Peripleo (https://github.com/britishlibrary/peripleo/blob/main/README.md) which can be used to map things related to texts.
The second speaker spoke about ‘The Dots and the Line. How to Visualize the Argumentative Structure of an Essay’. This looked at literary criticism and data visualisation, looking in particular at non-fiction writing. The speaker investigated metaphors for argumentation in the essays of the Italian author Italo Calvino, Such metaphors are often spatial concepts, such as line of argument, progressive or circular, speech orientation, stance taking and lateral thinking. The speaker wished to compare two or more essays based on visual form – deleting the text and using diagrams to compare visual similarities and differences – visually transforming the structure of an argument to show the connections between different parts. The process needed to be transferrable and automatable and the speaker chose to focus on argumentative connectives, which define the relationships that logically structure the meanings of sentence and text. The speaker used http://connective-lex.info which includes a catalogue of Italian connectives featuring 173 entries. The speaker looked at the different classes (contingency, comparison, expansion and temporal), grouping connectives found in 80 essays contained in two collections by the author into these classes.
The speaker used plain text versions of the essays and recorded the presence of the connectives in each essay. The speaker decided to focus on ‘dubitative text’ which are the roots of doubt and visualised this across the four classes. A line was used to represent the text, but this was made into a circle to save space. The line was thickened where dubitative text was found and different colours were used to represent the four classes. The speaker used https://observablehq.com/ and Figma (https://www.figma.com/) to generate the visualisations and as the essay length of each essay is normalised it is possible to compare multiple essays. For example, one essay has 15% dubitative text and of this 25% is contingency, 14% is comparison, 35% is expansions and 26% temporal. The visualisations are really very nice and the speaker stated that such an approach could be used to visualise other types of data in a text, such as dialogues or proper names.
The third paper was ‘Communication landscapes of the 19th Century: The speed, geographical coverage and content of news in the Rigasche Zeitung’ and it focussed on historical news networks, looking at the city of Riga in Latvia and how the 19th century press network operated there, specifically during the Crimean War. The speaker discussed attempts to identify the reuse of text in different forms, showing the long-term propagation of text over many months or years, but how this is computationally expensive, requires a lot of training and doesn’t work well across different languages. One such attempt was the Viral Texts Project (https://viraltexts.org/). The speaker wanted to identify alternative ways to map communication networks and looked instead at one specific newspaper, the Ragasche Zeitung. The speaker explained that in the 19th century Riga was a frontier region and a trading hub – the Easternmost outpost of the German merchant world and the Westernmost output of the Russian empire. The newspaper produced 18,499 issues from 1802 to 1888, with 289,704 articles available digitally, all OCRed and segmented, for example split into news or local news. Within the sections the text is divided into specific messages with dates and places in a heading. The speaker used regular expressions to capture these place-date headings, manually removed false positives and linked placenames together. The speaker stated that the 250 most common placenames covered 91% of the data, and also pointed out that there were calendar differences (Julian and Gregorian) that had to be taken into consideration. The speaker extracted 232,366 news items with geographical, temporal and semantic data that could then be plotted on a map over time. The speaker noted that the majority of items were based in Europe and more than half in the capitals, with German language places mentioned the most. It is possible to compare places mentioned in different years, for example how Berlin is not mentioned much in 1802 but is mentioned a lot by 1860, and also compare the speed of information exchange over time. News from New York took two months to reach Riga at the start of the period but a few days by the end. Visualisations can show the impact of technological developments – such as the first telegraph connection and the first railway line. The speaker also discussed how the Napoleonic wars affected news from London due to continental blockages.
The speaker then discussed the Crimean war (1853-1856), which was between Russia and Ottoman empire supported by Britain and France. It was the first ‘modern’ war – the first with war reporters, telegraph lines, media coverage and in Russia the news heavily censored. The speaker applied topic modelling to the whole corpus using the Top2vec topic modelling algorithm (https://github.com/ddangelov/Top2Vec) which identified 385 topics in total. The speaker picked out four that match Crimean war: War events, navy, recruitment, reinforcements and looked at 2702 news items which demonstrated the rise in news items during the period. War news mainly came from Constantinople, Vienna, London, Paris and Copenhagen – all rival powers of Russia. No news came from the Russian side (12 general items reported, representing 0.4% of items). Copenhagen was important because warships had to travel past it. This demonstrates that readers in Riga exist in the European news space and raises questions about Riga’s role in supplying the Russian capital (St. Petersburg at this time) with war information. The speaker pointed out that the structure of historical newspapers is well suited to network analysis and that the approach could be replicated across European historical newspapers to show European / global news network.
The third session was a short-paper session on Stylometry. The first paper was ‘Short texts with fewer authors. Revisiting the boundaries of stylometry’. This looked at the minimum text length that is required for reliable authorship attriburtion in stylometry. This is about 5000 words / tokens, bu the speaker proposes 2000 can work. However, the number of candidate authors (the more possibilities the more complex) needs to be considered. The speaker ran tests using text chunks from 200 to 5000 words using the Stylo R package, and the scripts can be found at https://github.com/SimoneRebora/stylometry_text_length. The speaker randomly created experimental setups for each corpus and the analysis was repeated 20 times for each configuration. An efficiency score was the proportion of correct attributions. The speaker noted that with two candidates 1000 words have 75-80% efficiency and there is a plateau at 2000 words.
The second paper was ‘Genre Identification and Network Analysis on Modern Chinese Prose Poetry’ and investigated the stylistic independence of Chinese prose poetry, looking at thousands of prose poems published during China Republican period (1911-1949). These were to be differentiated into ‘prose poems’ from ‘new poetry’. The speakers identified key stylistic features that play a role in differentiation. They looked at 1516 prose poems and 1925 new poems using rhythm annotation and tokens to extract ‘pauses’ based on N-grams. They randomly selected samples with 100 from each set to note differences in rhythmic features between prose and new poems. The results showed each genre has its own unique rhythmic patterns. New poetry follows a rhythmic pattern explained by pause patterns between and within sentences, plus they also looked at other things in addition to rhythmic features, for example structural, semantic, lexical and syntactic features. Visualisations were generated using Doc2Vec (https://radimrehurek.com/gensim/models/doc2vec.html).
The third paper was ‘Exploring genderlect markers in a corpus of Nineteenth century Spanish novels’. The speaker stated that there are not many sociolinguistic studies focussed on literary sources and questioned what sociolinguistic knowledge can be gained from literary works. The speaker used pyzeta (https://github.com/cligs/pyzeta) for text analysis, looking at distinct words of the female corpus – emotion words, family relations, body parts, interaction language, more adjectives, spatial words, professional domain, epistemic adverbs and adjectives. The speaker noted that ‘never’ and ‘always’ are more often used by women and that genre needs to be taken into consideration.
The fourth paper was ‘Tracing the invisible translator: stylistic differences in the Dutch translations of the oeuvre of Swedish author Henning Mankell’ and looked at translator style and visibility. The speaker noted that very little attention has been paid to the personal style of translators. And how style is seen as a characteristic of the original text. The translator is hard to discover using stylometric methods and the speaker tried to use machine learning approaches using n-grams but there is not so much about the features that distinguish different translators. The speaker used delta procedures but this was not so good at finding translators and better about author attribution. The speaker then used zeta analysis to investigate translator’s style as another possibility, looking at Henning Mankell’s novels, including Wallander but also other novels in translations from Swedish to Dutch. The speaker used 32 Swedish novels and 32 Dutch translations in 4 genres: Crime, literary, children’s and non-fiction, plus some none-Mankell novels too for comparison. The speaker used Bootstrap consensus analysis (see https://computationalstylistics.github.io/projects/bootstrap-networks/) to cluster books. There did seem to be clusters by translator, but there is a similar clustering in the original Swedish so any clustering was not down to the translator. Zeta analysis compared one translator to all others in one genre to identify the most distinctive features – preference for certain words, the use of old / archaic words. The speaker stated that zeta analysis can unveil individual style differences in translators, and also mean sentence length and lexical diversity, but that we also need to consider that the style of translators can change over time.
The final paper of the session was ‘On Burgundian (di)vine orators and other impostors: Stylometry of Late Medieval Rhetoricians’. The speakers stated that author names are often not given in the middle ages and that anonymity is a hallmark of heraldic literature. Authors often wrote using generic pseudonyms (regions, mottos, wordplays), for example many people identified as ‘golden fleece’. The speakers looked at anonymous poems by ‘Luxembourg the herald’ and others by a different name to see whether they could they be the same author. There are over 60 poems, mostly unpublished, but a scarcity of edited texts and most are very short. Difficulties include that fact that Middle French includes many exotic spellings and scribal variations. Word segmentation is also an issue and extensive use of ‘tres’ with adjectives and rare contractions. The speakers undertook annotation, lemmatisation and pos-tagging using SuperStyl (https://github.com/SupervisedStylometry/SuperStyl).
The final session I attended on Day 1 was the short-paper session Correspondence and Networks. The first paper was ‘Growing and Pruning the Republic of Letters: An Agent-Based Model to Build Letter Correspondence Networks’ and focussed on the use of computational methods to study correspondence networks using the Republic of Letters data in the Netherlands. Most data was concentrated in The Hague and involved a few people and the speaker used it to create a benchmark to apply to other collections of letters. The speaker developed an ‘Agent Based Model’ where agents are the people who send and receive letters, using an ‘ego-reinforcing random walk’ to determine who is likely to send letters to whom. This is based on reinforcement – when one letter is sent more are likely to follow. The output is a social network complete with topics modelling. The speaker discussed ‘growing and pruning’ the dataset: deleting letters at random and see what difference it makes to the model. The speaker stated that the network structure resilient to deletions with an ABM giving both flexibility and robustness. Experiments showed that longer ties are less robust (e.g. people further away geographically) and preserving letters from the top 5 people results in the model becoming more centralised. Heterogeneity is a key feature of ABM and the model can be used to understand what happens in a letter collection. The tools used to create the model are available at https://scientificcommunication.readthedocs.io/en/latest/.
The second paper was ‘Digital Prosopography and Global Irish Networks’ and discussed the Clericus project (https://clericus.ie/) which consists of a Neo4J graph database featuring around 25,000 biographical entries and 50,0000 relations. The speaker stated that it began with migration networks containing large quantities of biographical information and unstructured biographical data and was also set up to preserve student records at Maynooth College.
The third paper also discussed a graph database an was titled ‘Towards a Dynamic Knowledge Graph of a Non-Western Book Tradition’. It focussed on the in-development Bibliotheca Arabica (https://khizana.saw-leipzig.de/) that contains more than 1200 manuscripts and records of more than 100,000 people as linked open data. The project is integrating heterogeneous records with many different file formats and structures, often featuring different languages that are not marked, different alphabets and different directions of text. The speaker gave an overview of the project’s integration workflow, where data layer 1 is the raw data, layer 2 is the transformed data and layer 3 is the graph that is modelled. The speaker referred to ‘factoids’ that are individual statements and discussed topic maps in the data model before discussing how authority and disputed authorship attributions were managed via data driven identity management. The speaker pointed out that dealing programmatically with non-Western data is challenging, for example there can be very different styles of names and non-Western calendars are used. The speaker also noted that scalability is an issue with an ever growing graph and that the project is using caching and a Lucene-based index.
The fourth paper was ‘correspSearch v2.2 – Search historical correspondence’ and concerned the https://correspsearch.net/en/home.html website that was first developed in 2014 to search historical letters. Personally I find ‘correspSearch’ to be a very clunky name – even ‘correspoSearch’ would have flowed more pleasingly. The resource aggregates metadata from historical letters across printed and digital editions based on metadata stored on providers’ websites that is harvested. The project uses a TEI based correspondence format called ‘CMIF’ that the speaker pronounced ‘Smif’ (see https://encoding-correspondence.bbaw.de/v1/CMIF.html). The project holds metadata from over 220,000 letters from 60 data providers and letters are accepted from anywhere, with no limits in time and space. The resource offers facetted searching including occupations and gender, and features a nice timespan visualisation, plus an interesting map-based search to retrieve letters in a given region (you can draw on the map) or pick predetermined areas. The speaker discussed how you can create and update your own CMIF files via a web interface that also validates the data. The project is being extended to store people and places mentioned in the letters, the language and a full-text search and will be offering a SPARQL endpoint.
The final paper of the day was ‘Let Everything be of Use?: Data Issues in Exploring the Publications and Networks of the Members of the Fruitbearing Society in the VD17’ and it looked at the publications and networks of the Fruitbearing Society in the 17th century using data from the http://www.vd17.de/en/home project. This was an important cultural academy with nearly 900 members focussed on developing and promoting German as a language. Print was the main means through which members communicated and the project looked at how the volume and content of the data change over time. Different classes of members have been identified and the project is researching member publications over time. Publications have also been assigned to genres, but these are very general with ‘occasional literature’ being the most prominent.
I also attended the Poster Session and looked at all of the posters. The posters that particularly interested me were one about EVT3 – ‘Edition Visualization Technology’ that can automatically generate a digital edition form TEI text and images. I’d looked at this before when I was making my https://nme-digital-ode.glasgow.ac.uk interface several years ago and it was interesting to hear more about it. Further information can be found at https://visualizationtechnology.wordpress.com/.
I was also interested in a project the was collecting strike data from digitised historical newspapers. I might try and apply some of the processes identified in this poster to the Edinburgh Gazetteer data (https://edinburghgazetteer.glasgow.ac.uk/the-gazetteer/). This includes the Eynollah document layout analysis tool (https://github.com/qurator-spk/eynollah) to segment the pages and the Calamari OCR tool (https://github.com/Calamari-OCR/calamari).
Day 2 of the conference began for me with a long-paper session on Digital Editions. The first paper was titled ‘Proto-editions: Historians and the “Something between digital image and digital scholarly edition”’. The speaker discussed how an image can be text and that it’s our understanding of the symbols that defines whether they are ‘text’ or ‘image’. The speaker discussed whether digital images be classed as an ‘edition’ or a ‘scholarly edition’, stating that there are many arguments as to why a collection of images should not be treated as a scholarly edition. Text can have so many facets that it needs to be pluralistic and a digital edition of images has to have meaning. Websites that publish images and metadata and nothing else are not digital editions. The speaker made a distinction between a reproduction and an edition and also discussed ‘artificial editions’ generated by AI from images using HTR / OCR.
The second paper was ‘Scholarly Digital Editions: APIs and Reuse Scenarios’ and discussed reusing data from editions. The speaker stated that using and reusing data from editions is not new and has in fact been going on for centuries. With digital editions things change a bit as scholars both produce the code and interpret the documents. The speaker asked what editors can do to facilitate data reuse and mentioned different ways in which data can be accessed, from taking the whole dataset via web scraping or an accessible dump to a selection of the data using APIs or other endpoints such as SPARQL. The speaker noted that APIs have been around for a long time now and have been seen as ‘the future’ for a long time, and also stated that there are various generic APIs that can be applied to digital editions, including Canonical text services (CTS), OAI-PMH, the TEI publisher API, the EventSearch API and TextAPI. Generic or specific APIs (i.e. an API made specifically for a project) offer data in formats such as TEI XML, TXT and CSV and the endpoints a project offers need to be looked at – both the content and also structure. The speaker then discussed what to reuse and why, suggesting text analysis, incorporating texts into a larger corpus, gazetteers, citing edition data in dictionaries and combining editions. The speaker noted that when citing data from editions in historical dictionaries most still only cite printed editions, often due to technical reasons, for example the issue of stable referencing. References to Goethe’s writings were previously only from printed editions but editors are now beginning to think about linking to digital editions, enabling the linking through to the specific section of a page from the dictionary entry. The speaker stated that the DEAF and Dictionary of Old Norse Prose are also looking into this. The speaker concluded by stating that it is important to plan in advance for the reuse of data and to provide data in multiple formats from multiple access points. Including external persistent identifiers is a good thing, as is providing documentation.
The third paper in the session was titled ‘Digital Edition of Complete Tolstoy’s Heritage: OCR Crowd Sourcing Initiative, Literary Scholarship and User Scenarios’ and the speaker gave an overview of the 90 volume complete edition of Tolstoy’s works. The project had access to Tolstoy’s private library and notes, including photos and audio recordings and the 90 volumes, published between 1928 and 1958 contained 46,820 pages of text. The project undertook an OCR initiative to digitise all 90 volumes and these were proofread via crowdsourcing. The project involved around 3000 volunteers and resulted in HTML files that were very messy. Each volume was treated separately and there was no index. The Tolstoy digital initiative is producing a diplomatic digital edition, a digital archive and database and a website for public access for non-academics. When asked about the success of the crowdsourcing initiative the speaker stated that the project had lots of publicity and motivation for volunteers including free software. There were around 3000 volunteers with 30-40 forming the core.
The second session I attended was also about Digital Editions, this time short papers. The first paper had the interesting title ‘“With the 5ame name and adrvocation of S.Juan there is another one, in the sámeprovince”- towards a digital edition of the historical-geographical dictionary of the Indies by Antonio de Alcedo’ and as you might expect looked at OCR errors and the use of language and how to identify place-names. The dictionary source text features five volumes and 3000 pages containing a geographical dictionary / gazetteer of placenames. These are arranged from A-Z not by coordinates. It is an important reference work that was regularly cited, with an English translation in 1812 and an atlas built using the data. The project performed OCR on the text, then undertook semi-automatic correction using regex patterns that managed to fix 85% of the errors, but also introduced a few new errors. The project performed structural analysis on the pages, identifying 15,000 entries. These were manually corrected using Scripto and Omeka S. Entries were annotated using CATMA (https://catma.de/) to identify concepts, toponyms and settlement types. The text was then converted into TEI P5 text and entries needed to be disambiguated – for example there are more than 60 different ‘Santiago’ names in the data.
The second paper was titled ‘How to Be Non-Assertive in the ‘Assertive Edition’: Encoding Doubt in the Auden Musulin Papers’ and discussed digitally editing the Austrian papers of WH Auden (1907-1973), who lived and worked in Austria in his later years, from 1958-1973. The research was dual purpose – both into the poet’s text but also to gather historical / biographical information. The speaker stated that an ‘assertive edition’ addresses the needs of both textual scholars and historians. The project undertook in-depth markup of textual features in TEI XML and also the ‘facts’ in rigorously structured data modelled in RDF from the TEI. The project used TEI to markup facts using the <event> element, allowing for approximate timespans, noting communication type (e.g. telegram) and marking up people mentioned. People were recorded with with IDs and RDF data used CIDOC CRM (https://cidoc-crm.org/). The speakers noted that it is sometimes unclear who a person referred to is and this is handled with certainty tags. In RDF the project used CRMinf (https://www.cidoc-crm.org/crminf/home-4) to model uncertainty.
The third paper was ‘Jacob Bernoulli’s Reisbüchlein an RDF-star-based Edition’. Jacob Bernoulli was a Swiss Mathematician who travelled across Europe and kept journals, and the project focusses on his journals rather than any mathematical works. This include details about Sightseeing, meeting people, excursions, giving lectures and teaching including details about itineraries, accommodation and transportation. The data was gathered via typewritten transcripts that were fed through Transkribus (https://readcoop.eu/transkribus/) and converted to XML. The speaker created an RDF-star based edition, with RDF-star being an extension of RDF that allows for nested levels of RDF triples, with one triple being a subject or object of another. The project used the ‘trip’ ontology (https://enterpriseintegrationlab.github.io/icity/Trip/doc/index-en.html) which is an ontology for travel data in RDF-star. This allows journeys, subjourneys, mapping stays, locations and accommodation to be encoded, including the cost of the trip relating to stays at particular accommodation. The RDF data can be queried to show the exact route of a journey and the speaker demonstrated some SPARQL queries that the speaker stated were ‘very simple’ but actually looked horribly complex. The speaker also identified people and locations via Named Entity Recognition through Python.
The fourth paper was titled ‘Building a digital edition from archived social media content’ and discussed a project that is part of the https://www.c21editions.org/ project. The speaker discussed born digital content, such as Twitter feeds and forming a digital edition featuring a selection curated by editors. The speaker asked what is critical representation and historical documentation when the data is born digital – there are many versions and no definitive original. Content is ephermeral and changes and can be accessed through different channels. For example, Instagram in a browser doesn’t show ‘likes’ but when accessed via the API access it does. Editors need to be able to describe source and layer multiple sources, different versions and combine multiple sources, being as specific as necessary and as generic as possible. The project used an XML based schema to account for different sources using <source> and <sourceDesc> and required elements for describing the post and its metdata plus the network of the post (likes, comments, views). The project used the Instagram posts of an Irish poet as a prototype, including photographs of landscapes and editorial additions to show how this relates to her poetry.
The final paper in the session was ‘The digital edition as a nexus of documents and data for historical research: the example of the Imperial Diet records of 1576’ which discussed the records of the Imperial Diet of 1576, which convened in Regensburg from all over the holy Roman empire, forming a pre-modern parliament to decide taxes, laws and the economy. Hundreds of scribes made thousands of documents to record this and while there have been previous printed editions the project is creating a digital edition in TEI XML, including transcriptions of the documents and the critical apparatus. It also includes information about some documents that were not selected for transcription (proto-editions). The project uses an ontology of pre-modern parliamentary communication, making it an ‘assertive edition’. The speaker discussed different date categories that were used, including registration dates for certain individuals, dates documents were issued or presented or read. This enables researchers to look at chancery practices or postal routes and logged days are recorded in the minutes so can be used, along with dates mentioned in the transcribed text. The project manually compiled lists of meetings (e.g. unofficial ones) to compare these to the logged dates and explored visual representations of dates. The data came from 43 archives and consisted of more than 10,000 documents, with 4,500 pages edited to form the core data. From this 10,000 dates were extracted and these were visualised on a timeline with date categories and different colours to show numbers. The project also developed a heatmap with weeks of the diet and days of the week with colours showing the number of recordings in the minutes on each day.
The third session I attended was a long-paper session about Handwritten Text Recognition, which has come on a long way in the past few years. The first paper was ‘A speculative design for future handwritten text recognition: HTR use, and its impact on historical research and the digital record.’ The speaker stated that HTR has now been proven and works in a variety of formats and discussed the social, ethical and technical issues that may affect research over the next ten years. The speaker pointed out that it used to be that dealing with handwritten text was costly and error prone, but there are now several established and successful HTR solutions that can be used, the most successful being Transkribus (https://readcoop.eu/transkribus/). The speaker also mentioned other HTR tools such as Monk, Cortex and Scriptorium but I was unable to find much online about these (Scriptorium is probably eScriptorium – https://traces6.paris.inria.fr) . The speaker stated that HTR is already a product of machine learning and there will be further integration with advanced AI, for example Tanskribus is already looking at integrating ChatGPT for correcting transcriptions. The speaker stated that we need to be critical and mindful of bias and discussed HTR and legal frameworks and how public bodies need to consider copyright and data usage. HTR will complicate the ownership of data and removing human moderators can be risky, for example sensitivity reviews may be required. The speaker also stated that HTR contributions need to be properly credited, including model training, editing and correcting. The speaker also discussed the environmental costs for HTR, with the processed requiring a lot of energy. The speaker stated that HTR can use as much energy in 24 hours as a transatlantic flight (however, I think a more sensible comparison would be to compare energy usage by HTR with the energy that would be used for a human to undertake the same work, including heating, lighting and computer power – I suspect HTR would actually work out more energy efficient). The speaker also discussed data ethics and bias and how processing handwritten sources at scale could bring in more ethical issues and inherent biases.
The second paper was titled ‘Handwritten text recognition applied to the manuscript production of the Carthusian Monastery of Herne in the Fourteenth Century’ and looked at the outputs from the Herne monastery south of Brussels between 1350-1400. The monastery was a hotspot of scribal practice in the Low Countries, including the first prose bible translation into a European vernacular language, including lengthy prologues giving commentaries. Outputs included numerous texts on teachings, including unique texts that were saved from being lost. The speaker stated that the texts do not feature self-attributions so we don’t know which scribe did what. The outputs from the monastery included 54 booklets and manuscripts featuring 5,500 folia and around 1.2 million words and research shows 13 active scribes in the period. There was collaboration between scribes, who reviewed, corrected and copied each other’s work. As the monks had taken a vow of silence they communicated via marginalia, with the speaker stating that this was a bit like a Google Doc today. The speaker stated that there has been a lot of research into scribal identity but this has not been very successful. It relied on palaeographic and codicological features while the current project is shifting away from this and exploring linguistic features. The texts are available in a digital corpus with transcripts that stay close to the sources. 1,200 folia were manually transcribed, including every glyph and symbol using MUFI medieval Unicode (https://mufi.info/q.php?p=mufi). After manual transcription the researchers could then train Transkribus, which processed the remaining text with an error rate of 2.7 error characters per 100, tested on 133 folios. The speaker wanted to ascertain whether it was a good idea to train a model on material that contains so much variation and to what extent variation affects the accuracy of an HTR model. This was tested on two scribes. A model specifically trained on Scribe A got 2.56% as opposed to 5.72% for the ‘grand’ model. If the training on scribe A was then applied to the work of Scribe B the error rate is more than 8%. The speaker pointed out that the same scribe can have different writing modes: ‘high’ (the best), ‘medium’ (a bit more rushed) and ‘low’ (for writing short notes). The speaker trained Transkribus on 15,000 words of each mode. High and high gave 2.64% and high and medium 6.41%. When training 50/50 on high and medium this gave 4.23% on high. The speaker then discussed the evaluation of HTR models, discussing a CER (character error rate) score. The speaker has created a tool called CERberus that can be used to evaluate models (https://github.com/WHaverals/CERberus).
The third paper was titled ‘Manu McFrench, from zero to hero: impact of using a generic handwriting model for smaller datasets’. The speaker discussed creating a generic handwriting recognition model for French, stating that transcription models are costly to produce because they require training examples. Pre-existing models and data can be used and the speaker discussed the CREMMA project (https://github.com/HTR-United/cremma-medieval) to produce training data from 9th to 21st centuries for French and Latin manuscripts. The speaker’s project created a generic model so users didn’t have to start their transcriptions from zero. This reused existing datasets, such as HTR-United, linked to above, reusing existing public digital or printed editions and creating their own datasets. Existing editions needed to be diplomatic texts with no correction of text, no abbreviation resolution and no text normalisation because otherwise it’s not possible to align the text with the source images. A few existing datasets could be used with or without adaptation and the images and text from these were run through eScriptorium. The project also got people to create their own handwriting extracts using crowdsourcing. Around 200 people were involved and they had to write out random pages from Wikipedia. Kraken (https://kraken.re/main/index.html) was used to train the model, with an accuracy of 90%. The trained model could then be applied to other texts.
The final session that I attended on the second day was Networks and Graphs, and the first paper was ‘Fragmentation and Disruption: Ranking Cut-Points in Social Networks, a Case Study on Epistolary Networks at the Court of Henry VIII’. The speaker discussed how in the Tudor Court Thomas Cromwell in network terms was a cut point. When he was removed he cut off one part of the network from another. Such cut-points are nodes that when removed alter the topology of the network graph. The speaker investigated how to rank nodes and work out their importance in affecting the network, although as I’m not an expert in network theory most of the talk went completely over my head so I can’t really say any more about it.
The second paper was titled ‘Exploring the Evolution of Curatorial Diversity: a Methodological Framework with a Case Study of Book Reviews’ and discussed the analysis of a magazine called ‘Boletin Cultural y Biblographico’ and whether curatorial diversity could be seen over the years, looking at the level of difference in terms of the actors who participate (reviewers and authors) and also the cultural objects – books being reviewed and reviews themselves. The BCB was published from 1958 onwards and had involved more indigenous communities over time. In terms of curatorial diversity the speaker was primarily looking at gender balance. In the first 20 years most content of the mogazing was written by just 3 men with similar backgrounds. Over the years this expanded and included women. The speaker has generated a network scheme, selected independent (time scale) and dependent (dimensions of curatorial diversity) variables, but research still ongoing. The dataset is 3,382 book reviews from 1984 onwards. Before this the data is unreliable as authors were not properly attributed and many reviews were joined together (for example that data seems to include a 30 page reviews but it is really many individual ones). The research is looking at things such as reviewer gender, author gender, gender homophily, place of publication of book reviewed, publisher, and topic of book. The speaker stated that many authors and reviewers are not well known and don’t even have Wikipedia pages. There has been a huge gender disparity exists over the years, with about 30% women reviewers at best and around the same for authors. The speaker stated that there is no trend of increased diversity over time. The speaker mentioned using a ‘shannon index value’ from the field of biodiversity studies. In terms of publication place, more than 60% of books are published in Bogota and there has not been any change over the years. Publishing is very centralised and there has actually been a reduction in diversity of place in recent years.
The final paper of the day was ‘Word2Vec-Based Literary Networks – Challenges and Opportunities’. The speaker discussed translating a literary work into a network and the challenges inherent in this, for example seeing the communication system as a whole and the lack of temporality, with all connections visible at once. The speaker stated that with literary networks we cannot see everything, for example what brings two characters together and why is difficult to represent, as are indirect connections and connections mitigated by figures outside the novel. In plays a co-occurrence network can be established: if characters appear in the same scene they are said to be connected. The speaker gave the example of network visualisation for drama that are available at https://dracor.org/. For novels the unit is sentence, or counting direct communications between characters to generate conversational networks, but this is more difficult to extract automatically. The speaker explained how Word2Vec (e.g. https://www.tensorflow.org/tutorials/text/word2vec) can be used to identify hidden connections between words, e.g. milk and yoghurt, king and queen and wanted to investigate whether it could recognise links between literary characters in the same way. The speaker discovered that it can, generating a semantic network, but the results are biased by characters appearing in the same section of text. The speaker looked at the works of Amos Oz. There are 25 novels and available in an archive and he is a highly translated Israeli writer so there are many English translations. For each translated novel after pre-processing the project automatically created a co-occurence network, automatically creating a word2vec semantic network and manually creating a conversational network. The project generated network graphs showing semantic connections, but it wasn’t clear how the visualisations should be read. The speaker wished to combine the semantic networks and co-occurrence network for each novel and to visualise these as a heatmap – the more red the stronger the semantic connection, the more blue the stronger the co-occurrence connection.
I began Day 3 of the conference with a long-paper session on Machine Learning, beginning with a paper titled ‘Transformer-Based Named Entity Recognition for Ancient Greek’. This paper focussed on a new workflow for Named Entity Recognition in Ancient Greek. NER has been developed for many modern languages but for ancient languages there isn’t the NLP infrastructure, making it difficult to extract and identify named entities. Ancient Greek doesn’t have as many resources as modern languages and there aren’t many authority lists and training datasets. The speaker noted previous pipelines such as the Classical Language Toolkit (http://cltk.org/), Digital Athenaeus (https://www.digitalathenaeus.org/) and the Herodotos project for Latin (https://github.com/Herodotos-Project/Herodotos-Project-Latin-NER-Tagger-Annotation). The current project is working on translation alignment in Ancient Greek and other languages like Persian, collecting training data for the alignment model. The alignment workflow includes tokenisation, embeddings extraction, similarity matrix and alignment extraction that uses http://ugarit-aligner.com. The speaker stated that with this training model in place the project could then undertake NER using a training dataset of 24,600 ancient Greek entities. The speaker also stated that further training with English translations and an English NER dataset increased precision. The model is good for single token entities but not so good for multi-token. The project intends to improve this by using text from the New Testament to expand the dataset. The speaker stated the qualitative evaluation based on random verses has shown that the rate of correct alignment and correct NER for ancient Greek was 86%. The speaker concluded by stating that named entity classification is still a big challenge.
The second paper was ‘Using ECCO-BERT and the Historical Thesaurus of English to Explore Concepts and Agency in Historical Writing’ and discussed a new method to determine meanings in historical texts using a context aware language model using transformer architecture. As a case study the speaker looked at how the concept of ‘luxury’ changes during the 18th century from a previously corrupt force to something more complex. Luxury in the period is often either seen as a disease or as productive, with the latter increasing over the century. The project used BERT transformer models (https://huggingface.co/blog/bert-101) to detect shifts in historical language, with ECCO=BERT trained on ECCO (https://www.gale.com/intl/primary-sources/eighteenth-century-collections-online). The project combined data in the HTE with a context sensitive language model to understand meanings in particular contexts. This uses the assumption that a semantic category can be represented by the words in it – for example ill health represented by sickness. The project extracted senses from ECCO-BERT using senses from THE. They removed the seed word (luxury) and asked ECCO-BERT to predict the most likely word in the context, for example ‘wealth’ and then retrieve the HTE categories for these, generating a sum of the probabilities of predicted words returned for each category. The project applied to each instancy of ‘luxury’ in ECCO, extracted the category scores for all instances of luxury and the aggregate results show the top semantic categories. They identified two important ones: ‘usefulness’ and ‘a disease’, so two opposing views: corrupting vs productive. The speaker showed a bar chart that contained each year of the 18th century. This showed the occurrences of ‘disease’ were fairly flat but there was a marked increase for ‘usefulness’ in the latter half of the century, showing that the prevailing view of luxury became less negative over time.
The third paper in the session was ‘Machine Learning and Digital Classical Chinese Texts: Collaboration between the UC Computing Platform and Peking University’s Big-Data databases.’ This was introduced as being about discovering the history of food in China using a large (200 million character) corpus, but in reality most of the talk was discussing the technical infrastructure that was used. This is the Nautilus platform (https://nationalresearchplatform.org/nautilus/), which is a distributed hyper-converged cluster that is free to use. It runs on Kubernetes (https://kubernetes.io/) with software deployed in containers. The infrastructure spans the globe – mostly in the US but some in the rest of the world and currently has 1,228 installed GPUs, 20,000 CPUs and 95TB of space. It can deal with very large datasets and includes many collaboration tools. Jupyter notebooks (https://jupyter.org/) also connect to the platform. The platform mostly used by researchers in the sciences and the current Chinese project is the only humanist one running on it at the moment.
The second session I attended was a short-paper session on Machine Learning and AI, with the first paper titled ‘A Model of Heaven: Tracing Images of Holiness in a Collection of 63.000 Lantern Slides (1880-1940)’. The speaker explained that lantern slides are glass slides projected through a magic lantern, which is a projecting device. These were used for entertainment and education, and while the lantern was invented in the 17th century it was most popular in the 19th century. The travelogue genre popular, both real and imaginary places, for example heaven, the bottom of the sea, fairy worlds. Lanterns were used by societal groups, for example for propaganda purposes or church services. The slides were heterogenous and mixed media: photographs, black and white, colour, paintings etc. The speaker had access to a corpus of 63,000 slides, but looked at 3,339 of ‘the orient’ and representations of the holy land. From this the speaker picked a random sample of 1000 slides and investigated the concept of ‘holiness’. Images were clustered according to visual similarity using pixplot (https://dhlab.yale.edu/projects/pixplot/) but this wasn’t very successful as the images had similar borders that were taken to be the same. The speaker then used CLIP (https://openai.com/research/clip) to cluster images based on visual similarity based on the content of the slides, for example children’s slides are brightly coloured so are grouped and etchings are grouped together too, which is useful for religious representations. The speaker plotted concepts on a graph, for example ‘holy’ on one axis and ‘men’ on another. The speaker then demonstrated a tool that can show images returned for topics at https://rom1504.github.io/clip-retrieval/.
The second paper was titled ‘Towards a distant viewing of depicted materials in medieval paintings’ and it discussed the KIKI project (https://www.imareal.sbg.ac.at/en/imareal-projects/how-material-came-into-the-picture-kiki/) which is researching the depictions of materials in medieval paintings using image data from https://realonline.imareal.sbg.ac.at. The project aims to visualise the material used in each picture, for example the number of objects made from marble in order to identify clusters. The speaker noted that dry earth clogs or craggy rocks appear in most images and textures are not always depicted in pictures or can be depicted incorrectly. The example given was of a man holding a saw but the saw featured wood grain. The speaker stated that wood textures can take on many forms. The project segmented and annotated images using CVAT (https://www.cvat.ai/) to find wood patches and extract them. A convolutational neural network (https://en.wikipedia.org/wiki/Convolutional_neural_network) was then used to rank extractions from paintings against real wood.
The third paper was ‘Using Multimodal Machine Learning to Distant View the Illustrated World of the Illustrated London News, 1842-1900’ and this project also uses CLIP, as with the previous paper. The speaker stated that the Illustrated London News was the first visual mass medium, featuring illustration not photography and it was very successful in Victorian world. It can be used to study British identity, British empire, railways, the sewage system etc. Researchers have had to rely on close reading so far but there are thousands of images available. The speaker wished to create a distant view of the images – looking for patterns in thousands of images. The speaker stated that the problem was that models are trained on modern high quality photographs and can only recognise a few classes of objects e.g. a chair and not a carriage. The project used data from two issues per year, between 1842 and 1900, totalling 1,151 pages. The project used the Newspaper Navigator tool (https://github.com/LibraryOfCongress/newspaper-navigator) to extract images from newspapers. Once the images were extracted CLIP was then applied to them. This then allows the images to be queried, for example searching for ‘horse’ or ‘steamship’ returns all matching images. It is also possible to start with an image query, for example passing the tool an image of a horse. The speaker noted that it’s also possible to find reprints of the same image and users can visualise the images to show clusters, such as clusters of portraits or header images. The project is aiming to produce an open-access dataset with a visual search engine.
The fourth paper was titled ‘Probabilistic Modeling of Chronological Dates to Serve Machines and Scholars’. The speaker stated that dates are a fundamental part of our data and was interested in how dates can be inferred from data. The speaker looked at 500,000 diplomatic charter documents and extracted dates in different formats. I had a bit of trouble following this speaker so I can’t really say much more.
The final paper in the session was titled ‘They’re veGAN but they almost taste the same: generating simili-manuscripts with artificial intelligence’ and the speaker described generating artificial manuscripts. This is useful as data augmentation for HTR training and also for the automatic restoration of damaged manuscripts. The tool created uses a Generative Adversarial Network (see https://en.wikipedia.org/wiki/Generative_adversarial_network) and the Scrabblegan tool (https://towardsdatascience.com/scrabblegan-adversarial-generation-of-handwritten-text-images-628f8edcfeed) to create the images. The tool is trained in HTR using real examples, with lines split on character level. These are fed into the generator and this learned to generate realistic images, with a discriminator checking image likeliness for evaluation. The generated manuscripts can help HTR training tools to be faster. The speaker gave the example of Armenian texts. Text from chronicles were used to create fake chronicle images which could then be used to train HTR and then apply this to a bible text, resulting in a more than 10% improvement of accuracy. The project is also regenerating burnt manuscripts – manuscripts from Turin that were burnt in 1904.
The third and final parallel session of the conference that I attended was on Text Mining. I’d originally chosen this session because I was interested in the paper titled ‘Word Constellations – An Interactive Display of Distributed Semantics in the Gamergate Phenomenon’ but unfortunately the speaker didn’t turn up for the session, so we progressed straight to the second paper, titled ‘Computing Angel Names in Jewish Magic’. The project is using a collection of Jewish magical texts that started out as a fully manual project before migrating to digital tools. The corpus is Cairo Genizah, which contains 300,000 manuscript fragments including 2,500 magical and mystical manuscript fragments. These were stored in the attic of a synagogue in Cairo and contain material from the 10th century to modern period (late 19th century). The fragments contain instruction manuals for performing magic rituals and textual amulets on paper, parchment and cloth. They are written in Hebrew, Aramaic and Judaeo-Arabic (Arabic written in Hebrew script) and some Arabic. Some fragments are almost complete but others may only be half a page (for example containing holes). The speaker stated that the project is looking at the importance of angelic beings contained in the texts. The project has identified over 1000 angel names in the fragments, many only a single appearance. These can sometimes be identified because they end in ‘el’ or ‘yah’ – Nuriel, Taftafiyah, and others are well known, such as Gabriel and Raphael. In other cases they can be recognised by their context (e.g. ‘I adjure you, the holy angels…’). This has allowed the project to identify other angel names that might not be recognised. The speaker said that in addition some words are circled or overlined in the manuscripts and these are names of supernatural entities, and therefore angel names can sometimes be identified graphically. The speaker noted that all of the is simple for a human to do but more difficult for a computer. The project hasn’t attempted a graphical approach yet but they have established a database of angel names and determined some relationships between angels or angelic groups. The project wants to investigate whether there are consistent connections between the aims of the rituals and the angels invoked, for example certain angels associated with healing, giving birth or finding treasure. The speaker noted that the languages are right to left, which can cause problems for computers, and also there are no vowels in the way Western languages use them so one word can be spelled the same but have multiple meanings. Prefixes and suffixes are used, for example a prefix ‘le’ can be attached to mean ‘to Gabriel’ ‘leGabriel’, but there are other names that are legitimate e.g. ‘lehaqiel’. The speaker stated that the issue of right to left languages for computers has mostly been solved now, using Python libraries such as BiDi (https://python-bidi.readthedocs.io/en/latest/), but identifying angel names requires manual validation and a contextual approach. The speaker stated that the project developed a pipeline for the data but encountered unexpected difficulties. The project is working with transcriptions made in the 1990s by highly learned scholars, but these were created as Word files without any attention to consistent metadata, and there was no clear distinction between introductions and the source material. The speaker stated that a student needed to split these up manually. Of 2,500 manuscripts the project had about 1,000 after this, consisting of around 228,000 words. They generated lists of angel names based on endings with false positives stripped out manually. For magic recipes there were too few images for image recognition to work and the project attempted Named Entity Recognition using spaCy (https://spacy.io/) but didn’t have much luck. In the end the project ended up doing everything manually, but now the work is complete it is possible to do interesting things such as comparing the corpus with others, for example searching for angel names in the Sefer ha-Razim corpus for comparison, looking at other magic texts, deciphering mysterious acronyms found in the texts and looking at how the angel names have been used since, such as their presence on the internet. The project also created Venn diagrams showing the overlap of angel names in different corpora and created an acronym finder that runs through the Hebrew bible and tries to find the acronyms to identify biblical verses that correspond.
The final paper of the session, the day and the conference was ‘Word Clouds with Spatial Stable Word Positions across Multiple Text Witnesses’. The speaker stated that static word clouds show the supposedly most relevant words of a text according to some measure, while active word clouds allow you to (for example) pick two dates and during this the most relevant words are displayed. The speaker discussed the creation of a tool called LERA (https://lera.uzi.uni-halle.de/) that features a timeline view and generates word clouds for the selected segments. The timeline can be the start and end of the text, enabling the selection of paragraphs, for example. Such a tool need to process data quickly and must feature multiple windows with top words in each. The speaker discussed the issue of having the position and size of the selected word changing when a period is updated. Some words are no longer present in a different window but the user needs to look around to find this out. The LERA tool was developed to allow for spatial stability. Initial versions reserved a position for every word in a text but this meant the areas were too large. Therefore the speaker created groups of words that never occur together so these can share the same location the group can be placed and the actual word can be swapped out as required. The speaker showed an example text from Rumpelstiltskin featuring the words miller, daughter, king and gold. The speaker constructed a table with a line for each segment. A 1 is inserted if the word occurs and a 0 if not. From this the tool can create groups of words that never appear in the same segment. The speaker stated that the layout uses the Wordle (word cloud not the word game) layout: a spiral layout with places in a spiral for each word, and the tool placed a group in each location. It is then possible to get the tool to work with multiple texts by comparing the tabular outputs of each text.
After this session there was a final plenary session and the closing ceremony, but I didn’t attend these. It has been a useful conference, as always and I’m really glad I attended. Next week I return to Glasgow and a more regular working week.
I continued to work on the front-end for the Books and Borrowing project this week, and have now completed the migration of the dev site to Bootstrap. There are still some aspects that I would like to tweak further, but on the whole the layout is now much improved on all screen sizes. On the page that displays a page from a library register I have updated the navbar to make it position better on all screen sizes and I updated the image section so that it now now has a dark grey background. Text and image panels now work better on all screen sizes and ‘Image view’ now has a larger image viewer.
On the library books page I’ve improved that layout of the ‘change the view’ feature and when viewing books grouped by authors the author now has a blue background to help spot where the divisions between authors are. Layout of ‘top 100’ icon and the ESTC listing have also been improved:
I also updated the borrowers so these are now listed in a grid with four per row to make better use of the available space:
In the library ‘Facts’ page the ‘summary’ section now has a narrative flow rather than a bullet point list of figures. The ‘top 10’ lists now appear in columns with up to four per row. Layout of each list is improved. The borrower occupations summary is also more of a narrative flow and the two donut charts appear side by side (if there’s room). Similar improvements to the summary text in the other sections:
In the site-wide ‘books’ page the ‘change view’ and ‘limit’ sections now appear side by side and author view has the same blue backgrounds behind the author names. On the site-wide ‘borrwers’ page the ‘Limit the view’ section layout has been overhauled, giving it a two-column layout, ensuring it takes up much less space. It is still rather large, though. We could potentially hide it until the user chooses to open it. The list of borrowers is grid-based with up to fiveborrowers per row:
I’ve also massively overhauled the search forms in both the ‘simple’ and ‘advanced’ tabs. They now consist of multi-column displays grouped by data type. Tooltips are used where help information is included. Dotted lines are used to divide different types of data. Library registers with zero records are now excluded from the list:
There is still a fair amount to be done, including implement book genre, sorting out the highlighting in search results, investigating some situations when the year bar chart in the search results doesn’t display properly, adding in ‘cite this…’ options, adding in an ‘On this day’ feature and a ‘download data as CSV’ feature, update the API index page to add in information about licensing and to ensure all endpoints are listed with good examples, updating the API to ensure CSV output works properly for data that are arrays, integrating the dev site with the live site and ensuring all still works (e.g. the chambers map) and to add in a quick search option and navigation items as required, add in a cache for the facts and figures so it loads quicker and update the Solr index once work on the data is complete. But we’re getting there!
Also this week I began to write a requirements document for new date search / filter feature for the DSL website, along with sparklines and POS. I didn’t have the time to fully complete a first draft, but I did complete the sections about date searching and filtering so I sent it on to the DSL team for feedback.
I had to spend a bit of time this week migrating older sites over to external hosting, including the Emblems websites and the Forbes site. For the latter I took the opportunity to upgrade the site as it was still relying on a Flash based image viewer which would not work in any modern browser. I managed to switch this over to OpenLayers and now the site is working properly again in the first time for a few years (see https://forbes.gla.ac.uk/). I also set up a new website for an upcoming conference for Matthew Creasy. This took a little while to get right and will be launched in August.
For the Anglo-Norman Dictionary I had a lengthy email conversation with Geert about the new language tags and how these should be handled, especially in cases where the entry is a compound word. We managed to reach an agreement about how best to proceed with this and Geert is now going to add many more entries to the spreadsheet that should be given the language tags but did not previously include such a tag. I also spoke some more to Delphine about collaborative XML editors and did a bit more research into the options for these.
I’m on holiday next week and I’m attending the DH2023 conference the week after so I won’t be back in the office until after that.