Article by Madison Leeson: “Cultural heritage researchers often have to sift through a mountain of data related to the cultural items they study, including reports, museum records, news, and databases. The information in these sources contains a significant amount of unstructured and semi-structured data, including ownership histories (‘provenance’), object descriptions, and timelines, which presents an opportunity to leverage automated systems. Recognising the scale and importance of the issue, researchers at the Italian Institute of Technology’s Centre for Cultural Heritage Technology have fine-tuned three natural language processing (NLP) models to distill key information from these unstructured texts. This was performed within the scope of the EU-funded RITHMS project, which has built a digital platform for law enforcement to trace illicit cultural goods using social network analysis (SNA). The research team aimed to fill the critical gap: how do we transform complex textual records into clean, structured, analysable data?
The paper introduces a streamlined pipeline to create custom, domain-specific datasets from textual heritage records, then trained and fine-tuned NLP models (derived from spaCy) to perform named entity recognition (NER) on challenging inputs like provenance, museum registries, and records of stolen and missing art and artefacts. It evaluates zero-shot models such as GLiNER, and employs Meta’s Llama3 (8B) to bootstrap high-quality annotations, minimising the need for manual labelling of the data. The result? Fine-tuned transformer models (especially on provenance data) significantly outperformed out-of-the-box models, highlighting the power of small, curated training sets in a specialised domain…(More)