Work Package 7 - Pre-processing, storage and data sharing

WP7 is organized around four main tasks.

1. Infrastructure for text storage

This work package will design an infrastructure that will make it easier to store, share, and collaborate on political texts. This will explicitly enable dealing with data that cannot be shared openly by keeping data owners in direct control of the data and by employing trusted connections and non-consumptive research strategies to facilitate collaboration.

2. Manual annotation interface

In collaboration with WP6, this WP will develop an online interface that can be used for annotating material for validation or training data. Starting from existing open-source interfaces such as AmCAT and brat, we will make the needed changes to ensure that the annotations required for other WPs can be completed in the interface. This interface can be connected directly to the text storage system, but also as standalone software.

3. Multilingual preprocessing

This work package will standardize and evaluate linguistic preprocessing tools for the target languages of the project. For most widely used languages high-quality tools exist for linguistic tasks such as lemmatization, part-of-speech tagging, and coreference resolution. Applying such preprocessing makes subsequent analyses both more powerful and more robust since they work at a high semantic level of abstraction. However, there is a lack of standardization and easily accessible tooling, making it difficult for social science researchers to apply these techniques, especially in comparative projects. Starting from existing tools such as Spacy and UDPipe, but also taking into account current work in Computational Linguistics such as CLARIAH and CLARIN, this work package will compose a set of high-quality preprocessing tools that can be easily called from languages such as R and Python without requiring specific linguistic expertise.

4. Non-consumptive research

This WP will develop a toolkit for non-consumptive research, meaning that third parties have sufficient access to underlying data to check or replicate studies and conduct new research, but without getting direct access to data that cannot be publicly shared due to copyright, privacy, or other concerns. We will explore two approaches: allowing users access to a query interface where they can run queries and inspect snippets of the results; and ‘data capsules’ that allow users to designs scripts on a non-sensitive subset of the data which can then be run securely on the whole dataset (see e.g. Zeng et al., Cloud Computing Data Capsules for Non-Consumptive Use of Texts. ACM 2014).

Find out more about the team working on WP7!