Natural Language Processing at Notre Dame

Natural language processing (NLP) aims to enable computers to use human languages – so that people can, for example, interact with computers naturally; or communicate with people who don't speak a common language; or access speech or text data at scales not otherwise possible. The NLP group at Notre Dame is interested in all aspects of NLP, with a focus on machine translation and connections with formal language theory.

The NLP group co-sponsors NL+, the Natural Language Processing Lunch Seminar.

Current Members

David Chiang

Associate professor

translation, syntax, formal language theory, programming languages
Darcey Masters

PhD student

generation from language models, probabilistic models
Stephen Bothwell

PhD student

NLP for classical languages, computational historical linguistics
Ken Sible

PhD student

retrieval-augmented translation
Aarohi Srivastava

PhD student

dialects, noisy text, zero-shot transfer
Chihiro Taguchi

PhD student

computational linguistics, language documentation, syntax
Andy Yang

PhD student

neural network expressivity, formal logic
Katsumi Ibaraki

PhD student

multilingual, multimodal NLP

Former Members

Brian DuSell (PhD 2023 → ETH Zürich)
Colin McDonald (BA 2023 → CMU)
Patrick Soga (BS 2022 → UVA)
Xing Jie Zhong (MS 2021 → Google)
Toan Q. Nguyen (PhD 2021 → Amazon → Zoom)
Justin DeBenedetto (PhD 2021 → asst. prof. Villanova)
Chan Hee (Luke) Song (BS 2020 → OSU)
Kenton Murray (PhD 2020 → JHU)
Antonios Anastasopoulos (PhD 2019 → postdoc CMU → asst. prof GMU)
Arturo Argueta (PhD 2019 → Apple)
Tomer Levinboim (PhD 2017 → Google)
Xiang Zhou (summer intern 2017 → UNC)
Cindy Xinyi Wang (BS 2017 → PhD CMU → Google)
Ashish Vaswani (PhD 2014 at USC → USC ISI → Google Brain → Adept AI → Essential AI)

Projects

Expressivity of neural sequence models Relating neural sequence models to automata, grammars, circuits, and logics. Collaboration with Peter Cholak and Anand Pillay.

Retrieval-augmented neural machine translation Augmenting neural machine translation systems by retrieving and using data beyond parallel text. Collaboration with Meng Jiang. Sponsored by NSF.

Natural language (variety) processing Collaboration with Antonis Anastaspoulos (GMU) and Yulia Tsvetkov (UW). Sponsored by NSF.

Language documentation with an AI helper Collaboration with Antonis Anatasopoulos and Geraldine Walther (GMU). Sponsored by NSF.

Differentiable, probabilistic programming with recursive structured models Collaboration with Chung-chieh Shan (IU). Sponsored by NSF.

NLP on medieval texts Analysis of Latin texts and language modeling for OCR of Latin manuscsripts. Collaborations with Walter Scheirer and Hildegund Müller. Sponsored by Notre Dame FRSP.

Recent Publications

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, and David Chiang. Probability distributions computed by hard-attention transformers. 2025. arXiv:2510.27118. PDF BibTeX

@misc{yang-etal-2025-probability,
    author = "Yang, Andy and Svete, Anej and Li, Jiaoda and Lin, Anthony Widjaja and Rawski, Jonathan and Cotterell, Ryan and Chiang, David",
    title = "Probability Distributions Computed by Hard-Attention Transformers",
    year = "2025",
    note = "{arXiv}:2510.27118",
    url = "https://arxiv.org/abs/2510.27118"
}

Yotaro Kubo, Richard Sproat, Chihiro Taguchi, and Llion Jones. Building tailored speech recognizers for Japanese speaking assessment. 2025. arXiv:2509.20655. PDF BibTeX

@misc{kubo-etal-2025-building,
    author = "Kubo, Yotaro and Sproat, Richard and Taguchi, Chihiro and Jones, Llion",
    title = "Building Tailored Speech Recognizers for {J}apanese Speaking Assessment",
    year = "2025",
    note = "{arXiv}:2509.20655",
    url = "https://arxiv.org/abs/2509.20655"
}

Andy Yang, Christopher Watson, Anton Xue, Satwik Bhattamishra, Jose Llarena, William Merrill, Emile Dos Santos Ferreira, Anej Svete, and David Chiang. The transformer cookbook. Transactions on Machine Learning Research, January 2026. To appear. PDF BibTeX

@article{yang-etal-2025-cookbook,
    author = "Yang, Andy and Watson, Christopher and Xue, Anton and Bhattamishra, Satwik and Llarena, Jose and Merrill, William and Dos Santos Ferreira, Emile and Svete, Anej and Chiang, David",
    title = "The Transformer Cookbook",
    journal = "Transactions on Machine Learning Research",
    month = "January",
    year = "2026",
    url = "https://openreview.net/forum?id=sPshCSvDrX",
    note = "To appear"
}

Katsumi Ibaraki and David Chiang. Frustratingly easy data augmentation for low-resource ASR. 2025. arXiv:2509.15373. PDF BibTeX

@misc{ibaraki-chiang-2025-frustratingly,
    author = "Ibaraki, Katsumi and Chiang, David",
    note = "{arXiv}:2509.15373",
    year = "2025",
    title = "Frustratingly Easy Data Augmentation for Low-Resource {ASR}",
    url = "https://arxiv.org/abs/2509.15373"
}

Chihiro Taguchi, Seng Mai, Keita Kurabe, Yusuke Sakai, Georgina Agyei, Soudabeh Eslami, and David Chiang. Languages still left behind: toward a better multilingual machine translation benchmark. In Proc. EMNLP, 20142–20154. 2025. doi:10.18653/v1/2025.emnlp-main.1018. PDF BibTeX

@inproceedings{taguchi-etal-2025-languages,
    author = "Taguchi, Chihiro and Mai, Seng and Kurabe, Keita and Sakai, Yusuke and Agyei, Georgina and Eslami, Soudabeh and Chiang, David",
    title = "Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark",
    year = "2025",
    booktitle = "Proc. EMNLP",
    url = "https://aclanthology.org/2025.emnlp-main.1018/",
    doi = "10.18653/v1/2025.emnlp-main.1018",
    pages = "20142--20154"
}

Chihiro Taguchi, Seiji Maekawa, and Nikita Bhutani. Efficient context selection for long-context QA: no tuning, no iteration, just adaptive-\(k\). In Proc. EMNLP, 20116–20141. 2025. doi:10.18653/v1/2025.emnlp-main.1017. PDF BibTeX

@inproceedings{taguchi-etal-2025-efficient,
    author = "Taguchi, Chihiro and Maekawa, Seiji and Bhutani, Nikita",
    title = "Efficient Context Selection for Long-Context {QA}: No Tuning, No Iteration, Just Adaptive-{$k$}",
    year = "2025",
    booktitle = "Proc. EMNLP",
    url = "https://aclanthology.org/2025.emnlp-main.1017/",
    doi = "10.18653/v1/2025.emnlp-main.1017",
    pages = "20116--20141"
}

Andy Yang, Michaël Cadilhac, and David Chiang. Knee-deep in C-RASP: a transformer depth hierarchy. In Proc. NeurIPS 38. 2025. To appear. PDF BibTeX

@inproceedings{yang+:2025,
    author = {Yang, Andy and Cadilhac, Micha{\"e}l and Chiang, David},
    title = "Knee-Deep in {C-RASP}: A Transformer Depth Hierarchy",
    booktitle = "Proc. NeurIPS 38",
    note = "To appear",
    url = "https://arxiv.org/abs/2506.16055",
    year = "2025"
}

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. Simulating hard attention using soft attention. Transactions of the Association for Computational Linguistics, 2025. To appear. PDF BibTeX

@article{yang-etal-2025-softmax,
    author = "Yang, Andy and Strobl, Lena and Chiang, David and Angluin, Dana",
    title = "Simulating Hard Attention Using Soft Attention",
    year = "2025",
    journal = "Transactions of the Association for Computational Linguistics",
    xnote = "{arXiv}:2412.09925",
    note = "To appear",
    url = "https://arxiv.org/abs/2412.09925"
}

Aarohi Srivastava and David Chiang. We're calling an intervention: exploring fundamental hurdles in adapting language models to nonstandard text. In Proc. Workshop on Noisy and User-Generated Text. 2025. Best Paper Award. PDF BibTeX

@inproceedings{srivastava-chiang-2025,
    author = "Srivastava, Aarohi and Chiang, David",
    title = "We're Calling an Intervention: Exploring Fundamental Hurdles in Adapting Language Models to Nonstandard Text",
    booktitle = "Proc. Workshop on Noisy and User-Generated Text",
    year = "2025",
    url = "https://arxiv.org/abs/2404.07304"
}

Kenneth J. Sible and David Chiang. Using source-side confidence estimation for reliable translation into unfamiliar languages. 2025. arXiv:2503.23305v1. PDF BibTeX

@misc{sible-chiang-2025,
    author = "Sible, Kenneth J. and Chiang, David",
    title = "Using Source-Side Confidence Estimation for Reliable Translation into Unfamiliar Languages",
    year = "2025",
    url = "https://arxiv.org/abs/2503.23305v1",
    note = "arXiv:2503.23305v1"
}

Hiroyuki Deguchi, Go Kamoda, Yusuke Matsushita, Chihiro Taguchi, Kohei Suenaga, Masaki Waga, and Sho Yokoi. A soft and fast pattern matcher for billion-scale corpus searches. In Proc. ICLR. 2025. PDF BibTeX

@inproceedings{deguchi-etal-2025-soft,
    author = "Deguchi, Hiroyuki and Kamoda, Go and Matsushita, Yusuke and Taguchi, Chihiro and Suenaga, Kohei and Waga, Masaki and Yokoi, Sho",
    title = "A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches",
    booktitle = "Proc. ICLR",
    year = "2025",
    url = "https://openreview.net/forum?id=Q6PAnqYVpo"
}

Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, and Michael Hahn. A formal framework for understanding length generalization in transformers. In Proc. ICLR. 2025. PDF BibTeX

@inproceedings{huang-etal-2025-formal,
    author = "Huang, Xinting and Yang, Andy and Bhattamishra, Satwik and Sarrof, Yash and Krebs, Andreas and Zhou, Hattie and Nakkiran, Preetum and Hahn, Michael",
    title = "A Formal Framework for Understanding Length Generalization in Transformers",
    booktitle = "Proc. ICLR",
    year = "2025",
    url = "https://openreview.net/forum?id=U49N5V51rU"
}

David Chiang. Transformers in uniform TC\(^0\). Transactions on Machine Learning Research, January 2025. PDF BibTeX

@article{chiang:2025,
    author = "Chiang, David",
    title = "Transformers in Uniform {TC$^0$}",
    journal = "Transactions on Machine Learning Research",
    year = "2025",
    month = "January",
    url = "https://openreview.net/forum?id=ZA7D4nQuQF"
}

Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, and Ashish Sabharwal. Transformers as transducers. Transactions of the Association for Computational Linguistics, 13:200–219, 2025. doi:10.1162/tacl_a_00736. DOI BibTeX

@article{strobl-etal-2025-transducers,
    author = "Strobl, Lena and Angluin, Dana and Chiang, David and Rawski, Jonathan and Sabharwal, Ashish",
    title = "Transformers as Transducers",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "13",
    pages = "200--219",
    year = "2025",
    doi = "10.1162/tacl\_a\_00736"
}

Andy Yang, David Chiang, and Dana Angluin. Masked hard-attention transformers recognize exactly the star-free languages. In Proc. NeurIPS 37, 10202–10235. 2024. PDF BibTeX

@inproceedings{yang+:2024,
    author = "Yang, Andy and Chiang, David and Angluin, Dana",
    title = "Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages",
    booktitle = "Proc. NeurIPS 37",
    url = "https://proceedings.neurips.cc/paper\_files/paper/2024/hash/13d7f172259b11b230cc5da8768abc5f-Abstract-Conference.html",
    year = "2024",
    pages = "10202--10235"
}

Kenneth J. Sible and David Chiang. Improving rare word translation with dictionaries and attention masking. In Proc. AMTA, 225–235. 2024. PDF BibTeX

@inproceedings{sible-chiang-2024-improving,
    author = "Sible, Kenneth J. and Chiang, David",
    title = "Improving Rare Word Translation With Dictionaries and Attention Masking",
    booktitle = "Proc. AMTA",
    year = "2024",
    url = "https://aclanthology.org/2024.amta-research.19/",
    pages = "225--235"
}

Andy Yang and David Chiang. Counting like transformers: compiling temporal counting logic into softmax transformers. In Proc. CoLM. 2024. PDF BibTeX

@inproceedings{yang-chiang-2024-counting,
    author = "Yang, Andy and Chiang, David",
    title = "Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers",
    booktitle = "Proc. CoLM",
    year = "2024",
    url = "https://openreview.net/forum?id=FmhPg4UJ9K"
}

Chihiro Taguchi and David Chiang. Language complexity and speech recognition accuracy: orthographic complexity hurts, phonological complexity doesn't. In Proc. ACL. 2024. Outstanding Paper Award and Senior Area Chair Award. PDF BibTeX

@inproceedings{taguchi-chiang-2024-complexity,
    author = "Taguchi, Chihiro and Chiang, David",
    title = "Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't",
    year = "2024",
    url = "https://aclanthology.org/2024.acl-long.827/",
    booktitle = "Proc. ACL"
}

Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and Antonios Anastasopoulos. DIALECTBENCH: a NLP benchmark for dialects, varieties, and closely-related languages. In Proc. ACL. 2024. Social Impact Award. PDF BibTeX

@inproceedings{faisal+:2024,
    author = "Faisal, Fahim and Ahia, Orevaoghene and Srivastava, Aarohi and Ahuja, Kabir and Chiang, David and Tsvetkov, Yulia and Anastasopoulos, Antonios",
    title = "{DIALECTBENCH}: A {NLP} Benchmark for Dialects, Varieties, and Closely-Related Languages",
    year = "2024",
    booktitle = "Proc. ACL",
    url = "https://aclanthology.org/2024.acl-long.777/"
}

Stephen Bothwell, Brian DuSell, David Chiang, and Brian Krostenko. PILA: a historical-linguistic dataset of Proto-Italic and Latin. In Proc. LREC-COLING, 12749–12760. 2024. PDF BibTeX

@inproceedings{bothwell+:2024,
    author = "Bothwell, Stephen and DuSell, Brian and Chiang, David and Krostenko, Brian",
    title = "{PILA}: A Historical-Linguistic Dataset of {P}roto-{I}talic and {L}atin",
    booktitle = "Proc. LREC-COLING",
    pages = "12749--12760",
    year = "2024",
    url = "https://aclanthology.org/2024.lrec-main.1116/"
}

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? A survey. Transactions of the Association for Computational Linguistics, 12:543–561, 2024. doi:10.1162/tacl_a_00663. DOI BibTeX

@article{strobl-etal-2024-survey,
    author = "Strobl, Lena and Merrill, William and Weiss, Gail and Chiang, David and Angluin, Dana",
    title = "What Formal Languages Can Transformers Express? {A} Survey",
    year = "2024",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "12",
    pages = "543--561",
    doi = "10.1162/tacl\_a\_00663"
}

Brian DuSell and David Chiang. Stack attention: improving the ability of transformers to model hierarchical patterns. In Proc. ICLR. 2024. Spotlight paper. PDF BibTeX

@inproceedings{dusell+chiang:2024attention,
    author = "DuSell, Brian and Chiang, David",
    title = "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns",
    year = "2024",
    booktitle = "Proc. ICLR",
    url = "https://openreview.net/pdf?id=XVhm3X8Fum"
}

Stephen Bothwell, Justin DeBenedetto, Theresa Crnkovich, Hildegund Müller, and David Chiang. Introducing rhetorical parallelism detection: a new task with datasets, metrics, and baselines. In Proc. EMNLP, 5007–5039. 2023. doi:10.18653/v1/2023.emnlp-main.305. PDF BibTeX

@inproceedings{bothwell+:2023,
    author = {Bothwell, Stephen and DeBenedetto, Justin and Crnkovich, Theresa and M{\"u}ller, Hildegund and Chiang, David},
    title = "Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines",
    booktitle = "Proc. EMNLP",
    year = "2023",
    url = "https://aclanthology.org/2023.emnlp-main.305",
    doi = "10.18653/v1/2023.emnlp-main.305",
    pages = "5007--5039"
}

Aarohi Srivastava and David Chiang. BERTwich: extending BERT's capabilities to model dialectal and noisy text. In Findings of ACL: EMNLP. 2023. PDF BibTeX

@inproceedings{srivastava+chiang:2023,
    author = "Srivastava, Aarohi and Chiang, David",
    title = "{BERTwich}: Extending {BERT}'s Capabilities to Model Dialectal and Noisy Text",
    booktitle = "Findings of ACL: EMNLP",
    year = "2023",
    url = "https://aclanthology.org/2023.findings-emnlp.1037/"
}

Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, and David Chiang. Universal automatic phonetic transcription into the International Phonetic Alphabet. In Proc. INTERSPEECH. 2023. doi:10.21437/Interspeech.2023-2584. PDF BibTeX

@inproceedings{taguchi+:2023,
    author = "Taguchi, Chihiro and Sakai, Yusuke and Haghani, Parisa and Chiang, David",
    title = "Universal Automatic Phonetic Transcription into the {I}nternational {P}honetic {A}lphabet",
    booktitle = "Proc. INTERSPEECH",
    year = "2023",
    url = "https://arxiv.org/abs/2308.03917",
    doi = "10.21437/Interspeech.2023-2584"
}

David Chiang, Colin McDonald, and Chung-chieh Shan. Exact recursive probabilistic programming. PACMPL, 2023. doi:10.1145/3586050. PDF BibTeX

@article{chiang+mcdonald+shan:2023,
    author = "Chiang, David and McDonald, Colin and Shan, Chung{-}chieh",
    title = "Exact Recursive Probabilistic Programming",
    journal = "PACMPL",
    volume = "7",
    number = "OOPSLA1",
    article = "98",
    xmonth = "April",
    url = "https://dl.acm.org/doi/10.1145/3586050",
    year = "2023",
    doi = "10.1145/3586050"
}

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and others. Beyond the Imitation Game: quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. PDF BibTeX

@article{srivastava+:2023,
    author = "Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R. and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{\`a} and others",
    title = "Beyond the {I}mitation {G}ame: Quantifying and extrapolating the capabilities of language models",
    journal = "Transactions on Machine Learning Research",
    year = "2023",
    url = "https://openreview.net/pdf?id=uyTL5Bvosj"
}

All papers →

Language and Computation at Notre Dame

Research

People

Meng Jiang: summarization and generation
Toby Li: human-computer interaction
Walter Scheirer: digital humanities and handwriting recognition
John Lalor (ITAO): NLP and biomedical informatics

Courses

CSE 40657/60657, Natural Language Processing, Prof. David Chiang
CSE 40982, Interactive Dialogue Systems, Prof. Collin McMillan
ITAO 40250, Unstructured Data Analytics, Prof. John Lalor
AL 20301, Introduction to Linguistics, Prof. Hana Kang