Practical Workshop PHSS 2018: Annotation of Metadata and Inquiries on Big Corpora

Dear participants in the PHSS conference,

We intend to organize a practical, hands-on workshop in the field of natural language processing, focusing on computational lexicography and machine summarisation. We aim to have an interactive seminar in which participants work together with us. The main activities will be centered on the following issues: how metadata are annotated on the CoRoLa platform, and how we make queries on the KorAP platform in order to find words, constructions, occurrences both in written and speech corpora, and work with metadata filters.

Trainers: Dr. Anca-Diana Bibiri, Dr. Alex Moruz

Organizers: Faculty of Computer Sciences, UAIC, Department of Interdisciplinary Research in Social Sciences and Humanities

Language: Romanian

The workshop will be held on the Thursday, the 24th of May 2018. It is open for those who register via email at, until the 18th of May 2018.

What is CoRoLa?

The Reference Corpus of Contemporary Romanian Language CoRoLa, run by the “Mihai Drăgănescu” Research Institute for Artificial Intelligence in Bucharest and the Institute for Computer Science in Iași, is a corpus in electronic format, available (online) for free, in order to be used for studies on contemporary language, for processing language, for creating applications that use knowledge extracted from large corpora, for improving translation and for teaching Romanian. CoRoLa includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.

CoRoLa corpus includes two types of annotation: 1. metatextual (information about the text) – metadata; and 2. linguistic (phonetic, prosodic, morphological, phrasal, syntactic, semantic, pragmatic).

The metadata annotators (many of which are volunteers) work under the guidance of a detailed Annotation Manual. The online platform developed at IIT-Iaşi (Romanian Academy, Institute for Computer Science – Iaşi), which includes facilities for cleaning formatting, standardizing Romanian diacritics, eliminating hyphenation, visualizing statistics about the quantity of texts accumulated and their subdomains, and filling in metadata. However, many clearing phases are still done manually: separating articles from periodicals in different files, removal of headers, page numbers, figures, tables, text fragments in foreign languages, excerpts from other authors, and annotation of footers and end-notes (decided to be left in the texts).

Dr. Anca-Diana Bibiri and Dr. Alex Moruz are active members of the Natural Language Processing (NLP) Group at the “Alexandru Ioan Cuza” University of Iași.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s