Lunch at 12:30pm, talk at 1pm, in 148 Fitzpatrick

Title: Universal Dependencies Tatar for Code-Switching Detection

Abstract: This study introduces a new project to create Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first provide an outline of Universal Dependencies and NMCTT. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.

Bio: Chihiro Taguchi is a first year PhD student at the Natural Language Processing laboratory led by Dr. David Chiang. His main research is language technologies for documentation of low-resource endangered languages. His research interests involve computational, descriptive, and theoretical aspects of languages, including syntax, semantics, morphology, and phonology.