Institute of the Czech National Corpus

seminarMajor Research Areas

The Institute’s principal research focus is to maintain and further develop the Czech National Corpus project (CNC). The Institute is primarily responsible for data collection and user services both for language professionals and general public.

Data collection

The ICNC has a long tradition in collecting linguistic data for the research of Czech language with particular attention on both their quantity and quality, but also their variety. The ICNC collaborates with over 200 researchers and students (mainly for spoken and parallel data acquisition), 270 publishers (as text providers), and other similar research projects (see the list of national and international partners). The CNC focuses systematically on the following areas:

  • Contemporary written Czech: the SYN-series corpora mapping the language of the 20th and 21st century (esp. the last twenty years) constitute the flagship of the project. Texts are enriched with metadata, lemmatization, and morphological tagging. In addition to the representative and balanced corpora (consisting of fiction texts, professional literature, and newspapers) published every five years, the series also includes large corpora consisting of journalistic texts only. In 2013, the total size of the SYN series exceeded 2.2 billion words, thus ranking among the largest freely-accessible traditional (as opposed toweb-crawled) corpora in the world.
  • Contemporary spontaneous spoken Czech: the hallmark of the ORAL-series is their careful design. They include solely authentic, spontaneous, informal, and dialogical speech (as opposed to prepared, broadcast or scripted texts generally prevailing in spoken corpora). Since 2008, our priorities are focused on demographically and regionally balanced data, as well as sound-aligned transcripts. The total size of the ORAL series is 4.8 million words (according to our survey it is one of the largest collections of spontaneous speech in the world). Recently, new ORTOFON series was established that features two-layer-transcription (orthographic and phonetic).
  • Multilingual parallel corpus: InterCorp is a large corpus of Czech texts aligned on a sentence level with translations to or from more than 30 languages. The core of the corpus consists of manually aligned and proofread fiction texts, supplemented by collections of automatically processed texts from various domains. InterCorp constitutes a unique data source for contrastive and translation studies of Czech and other languages. InterCorp users appreciate its unprecedented size (almost 1 billion words in total), variety of text types, and annotation.
  • Diachronic corpus of Czech: the DIAKORP corpus of historical Czech includes texts from 14th century onwards. The current focus of DIAKORP is on the 19th century, with the long-term goal of creating a large monitor corpus covering the period of 1850–present and interconnecting the data with the SYN series.
  • Specialised linguistic data: the ICNC is also involved in the collection of language data for specific research purposes, e.g. DIALEKT (dialectal speech), CzeSL (texts written by non-native learners of Czech), DEAF (Czech texts written by the deaf), or Jerome (translated and non-translated Czech).
ukazka_korpusuUser services

The ICNC provides its users with the following services:

  • User access: specialised applications are available for efficient use of large corpora by our users. They are partly developed in-house (SyD, Morfio, KWords), partly based on third-party applications (e.g. the KonText concordancer). Plans for the future include applications for multi-word unit analysis and corpus query result evaluation with statistical methods.
  • Methodology of corpus linguistics: the ICNC is the only research centre in the country focusing systematically on developing the methodology of corpus linguistics. The design of new analyticalapplications is based on this strategic research.
  • User support: the CNC research portal features a User Forum (with questions and answers), a corpus linguistics Wiki and user manual. The portal also features a repository of CNC-based research papers following the open access policy (green road).
  • Consulting, education, and training: apart from maintaining the User Forum, the ICNC organizes workshops and academic training at various levels (undergraduate, graduate, and postgraduate), including its own PhD programme in Corpus Linguistics with specialization (but not limited to) in Czech.
  • Hosting of corpora: currently, 27 corpora compiled at other institutions are hosted within the CNC; the hosting service includes final technical processing, quality checks, and public access with related services.
  • Analysis of user data: the ICNC additionally executes custom-based data analysis of data supplied by users themselves for e.g. sociological, neurological, and psychological research purposes using the techniques and methodology of text processing developed at the ICNC (see the interdisciplinary cooperation)
  • Providing data packages: on demand, data packages can be extracted from CNC corpora in various formats and processed according to users’ needs (e.g. for NLP, psychology, literary studies).

Studie_z_korpusove_lingvistikyNational and International Partner Institutions

The ICNC has been closely collaborating with a large number of Czech academic institutions: Institute of Theoretical and Computational Linguistics, Faculty of Arts, Charles University in Prague; Faculty of Informatics, Masaryk University in Brno; Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague; Faculty of Electrical Engineering, Czech Technical University in Prague; Brno University of Technology; University of West Bohemia in Plzeň; Technical University of Liberec; Institute of Czech Studies, Faculty of Arts, Charles University in Prague; Institute of Czech Language and the Theory of Communication, Faculty of Arts, Charles University in Prague; Institute of the Czech Language, Institute of Slavonic Studies, Institute of Czech Literature, the Academy of Sciences of the Czech Republic; Department of Czech Language, Faculty of Arts, Masaryk University in Brno; Faculty of Education, Masaryk University in Brno; Palacký University in Olomouc; University of Hradec Králové; Silesian University in Opava. The ICNC also actively maintains its contacts built over the last twenty years with a number of European institutions and researchers. The cooperation agreements include the following foreign institutions:

  1. “Orientale” University of Naples, Naples, Italy
  2. Polish Academy of Sciences (Institute of Computer Science, Department of Artificial Intelligence), Warsaw, Poland
  3. Slovak Academy of Sciences (Ľ. Štúr Institute of Linguistics), Bratislava, Slovak Republic
  4. Sorbian Institut (Department of Linguistics), Bautzen, Germany
  5. The Institute of German Language (IDS), Mannheim, Germany
  6. Tübingen University (Faculty of Humanities, Department of Slavonic Languages), Tübingen, Germany
  7. University of Amsterdam, Amsterdam, Netherlands
  8. University of Bern (Faculty of Philosophy and History, Institute of Slavic Languages and Literature), Bern, Switzerland
  9. University of Latvia (Institute of Mathematics and Computer Science), Riga, Latvia
  10. University of Regensburg (Faculty of Philosophy IV, Slavonic Department), Regensburg, Germany
  11. University of Warsaw, Institute of Western and Southern Slavic Studies, Warsaw, Poland

Active research cooperation has also been established with the Department of Slavic Languages at the Brown University in the USA.

Interdisciplinary cooperation

The ICNC has also initiated and established interdisciplinary cooperation and expertise sharing in the following fields:

  • Education: The CNC web applications and data have been incorporated into primary and secondary school textbooks published by the Fraus Publishing House;
  • History: The ICNC participates in compiling a corpus of the Secret Police (StB) archives (an ongoing joint project with the Institute for the Study of Totalitarian Regimes);
  • Sociology: The ICNC took part in the compilation of biographical narratives corpus and carried out a differential analysis of three groups of speakers (communist officials, dissidents and general public) for sociologic research project “Institutions in Life Stories” (with the Faculty of Social Sciences, Charles University, 2010–2011);
  • Neurology and psychology: The ICNC took part in the comparison of reference language corpora with texts of the mentally disordered allowing detection of specific features of their language (a survey for the Institute of Psychology, Academy of Sciences).

Major Publications

In addition to the online publication of language corpora, the ICNC in the cooperation with the Lidové Noviny Publishing House also publishes its own series called Studie z korpusové lingvistiky (Studies in Corpus Linguistics), featuring more than 20 volumes so far. Some of them are listed below in the summary of 10 most important publications produced by the ICNC team in the 2009–2013 period:

  1. Bartoň, T., Cvrček, V., Čermák, F., Jelínek, T. & Petkevič, V. (2009). Statistiky češtiny (Statistical Data on Czech). Praha: Nakladatelství Lidové noviny & Ústav Českého národního korpusu. ISBN 978-80-7422-264-1.
  2. Cvrček, V. et al. (2010). Mluvnice současné češtiny (Grammar of Contemporary Czech). Praha: Karolinum. ISBN 978-80-246-1743-5.
  3. Cvrček, V. (2013). Kvantitativní analýza kontextu (Quantitative Analysis of Context). Praha: Nakladatelství Lidové noviny. ISBN 978-80-7106-594-4.
  4. Čermák, F., Cvrček, V. & Schmiedtová, V. et al. (2010). Slovník komunistické totality (Dictionary of the Communist Regime). Praha: Nakladatelství Lidové noviny. ISBN 978-80-7422-060-9.
  5. Čermák, F., Corness, P. & Klégr, A. (Eds). (2010). InterCorp: Exploring a Multilingual Corpus. Praha: Nakladatelství Lidové noviny. ISBN 978-80-7422-042-5.
  6. Čermák, F., Křen, M. et al. (2011). Frequency Dictionary of Czech: Core Vocabulary for Learners. London: Routledge. ISBN 978-0-415-57661-1.
  7. Čermáková, A. (2009). Valence českých substantiv (Valency of Czech Nouns). Praha: Nakladatelství Lidové noviny. ISBN 978-80-7106-426-8.
  8. Vondřička, P.: Formalized contrastive lexical description: a framework for bilingual dictionaries. LINCOM Studies in Computational Linguistics, 2014. ISBN 978-3-86288-428-5
  9. Čermák, F.: Periferie jazyka – Slovník monokolokabilních slov. Nakladatelství Lidové noviny. Praha 2014. ISBN: 978-80-7422-349-5.
  10. Křen, M. (2013). Odraz jazykových změn v synchronních korpusech (Reflection of Language Change in Synchronic Corpora). Praha: Nakladatelství Lidové noviny. ISBN 978-80-7422-265-8.

OLYMPUS DIGITAL CAMERAMajor Conferences and Events

Our academic members of staff regularly participate in both domestic and international conferences where they present the results of their work. Every two or three years, the Institute itself hosts a conference on corpus linguistics. The first conference in 2007 was titled Čeština v mluveném korpusu (Czech in the Spoken Corpus), whereas the 2009 conference focused on the InterCorp parallel corpus welcomed many international speakers and brought a fruitful discussion on parallel corpora and contrastive research. In 2011, the conference Corpus Linguistics Prague 2011 covered all branches of our research, i.e. synchronic written language, spoken language, diachronic language, parallel corpora research, and issues relating to linguistic tagging. In 2014, the ICNC organized a conference Korpusová lingvistika Praha 2014 (Corpus Linguistics Prague 2014) as a celebration of its 20th anniversary. The ICNC regularly organizes free workshops for language professionals (such as linguists, teachers or translators) as well as for general public (anyone interested in working with Czech language corpora and other language corpora published within the CNC).