«Tugan Tel» Tatar National Corpus

«Tugan Tel» Tatar National Corpus is a linguistic resource of the modern literary Tatar language. The project is carried out within the framework of the "Preservation, study and development of the official languages of the Republic of Tatarstan and other languages in the Republic of Tatarstan for 2014-2020" State Program. The developed Corpus is intended for a wide range of users: for linguists, specialists in the Tatar language and culture, teachers of the Tatar language, cultural workers, and for everyone who is interested in studying the Tatar language.

The volume of the Corpus is 180,000,000 tokens (by December, 2018). The Corpus contains texts of different styles and genres (fiction, media texts, official documents, educational and scientific literature, etc.). The Corpus has a system of grammatical annotation that is oriented at presenting all the existing grammatical word-forms. Grammatical annotation of a Tatar word includes the information about the part of speech of the word and a set of morphological features (parameters). Morphological annotating of Corpus texts is carried out using the module of two-level morphological analysis of the Tatar language implemented in the program tool PC-KIMMO. The search system of the Corpus enables for a search for lexemes, word forms and individual grammatical parameters.

Participants of the project are researchers of the Institute of Applied Semiotics of Tatarstan Academy of Sciences and professors of Kazan Federal University (D. Suleymanov, O. Nevzorova, R. Gilmullin, A. Gatiatullin, A. Galieva, B. Khakimov, D. Yakubova, R. Gataullin, D. Mukhamedshin, R. Bilalov), and students of Kazan Federal University.