Natural language processing (NLP) aims to enable computers to use human languages – so that people can, for example, interact with computers naturally; or communicate with people who don't speak a common language; or access speech or text data at scales not otherwise possible. The NLP group at Notre Dame is interested in all aspects of NLP, with a focus on machine translation and connections with formal language theory.
The NLP group co-sponsors NL+, the Natural Language Processing Lunch Seminar.
Current Members
Former Members
- Brian DuSell (PhD 2023 → ETH Zürich)
- Colin McDonald (BA 2023 → CMU)
- Patrick Soga (BS 2022 → UVA)
- Xing Jie Zhong (MS 2021 → Google)
- Toan Q. Nguyen (PhD 2021 → Amazon → Zoom)
- Justin DeBenedetto (PhD 2021 → asst. prof. Villanova)
- Chan Hee (Luke) Song (BS 2020 → OSU)
- Kenton Murray (PhD 2020 → JHU)
- Antonios Anastasopoulos (PhD 2019 → postdoc CMU → asst. prof GMU)
- Arturo Argueta (PhD 2019 → Apple)
- Tomer Levinboim (PhD 2017 → Google)
- Xiang Zhou (summer intern 2017 → UNC)
- Cindy Xinyi Wang (BS 2017 → PhD CMU → Google)
- Ashish Vaswani (PhD 2014 at USC → USC ISI → Google Brain → Adept AI → Essential AI)
Projects
Expressivity of neural sequence models Relating neural sequence models to automata, grammars, circuits, and logics. Collaboration with Peter Cholak and Anand Pillay.
Retrieval-augmented neural machine translation Augmenting neural machine translation systems by retrieving and using data beyond parallel text. Collaboration with Meng Jiang. Sponsored by NSF.
Natural language (variety) processing Collaboration with Antonis Anastaspoulos (GMU) and Yulia Tsvetkov (UW). Sponsored by NSF.
Language documentation with an AI helper Collaboration with Antonis Anatasopoulos and Geraldine Walther (GMU). Sponsored by NSF.
Differentiable, probabilistic programming with recursive structured models Collaboration with Chung-chieh Shan (IU). Sponsored by NSF.
NLP on medieval texts Analysis of Latin texts and language modeling for OCR of Latin manuscsripts. Collaborations with Walter Scheirer and Hildegund Müller. Sponsored by Notre Dame FRSP.
Recent Publications
Andy Yang, Lena Strobl, David Chiang, and Dana Angluin.
Simulating hard attention using soft attention.
arXiv:2412.09925.
PDF
BibTeX
Andy Yang, David Chiang, and Dana Angluin.
Masked hard-attention transformers recognize exactly the star-free languages.
In Proc. NeurIPS. 2024.
To appear.
PDF
BibTeX
Ken Sible and David Chiang.
Improving rare word translation with dictionaries and attention masking.
In Proc. AMTA. 2024.
PDF
BibTeX
Andy Yang and David Chiang.
Counting like transformers: compiling temporal counting logic into softmax transformers.
In Proc. CoLM. 2024.
PDF
BibTeX
Aarohi Srivastava and David Chiang.
We're calling an intervention: taking a closer look at language model adaptation to different types of linguistic variation.
2024.
arXiv:2404.07304.
PDF
BibTeX
Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, and Ashish Sabharwal.
Transformers as transducers.
Transactions of the Association for Computational Linguistics, 2024.
To appear.
PDF
BibTeX
Tatsuya Aoyama, Chihiro Taguchi, and Nathan Schneider.
J-SNACS: adposition and case supersenses for Japanese joshi.
In Proc. LREC-COLING, 9604–9614. 2024.
PDF
BibTeX
Chihiro Taguchi and David Chiang.
Language complexity and speech recognition accuracy: orthographic complexity hurts, phonological complexity doesn't.
In Proc. ACL. 2024.
Outstanding Paper Award and Senior Area Chair Award.
PDF
BibTeX
Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and Antonios Anastasopoulos.
DIALECTBENCH: a NLP benchmark for dialects, varieties, and closely-related languages.
In Proc. ACL. 2024.
Social Impact Award.
PDF
BibTeX
Stephen Bothwell, Brian DuSell, David Chiang, and Brian Krostenko.
PILA: a historical-linguistic dataset of Proto-Italic and Latin.
In Proc. LREC-COLING, 12749–12760. 2024.
PDF
BibTeX
Chihiro Taguchi, Jefferson Saransig, Dayana Velásquez, and David Chiang.
KILLKAN: the automatic speech recognition dataset for Kichwa with morphosyntactic information.
In Proc. LREC-COLING, 9753–9763. 2024.
PDF
BibTeX
Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin.
What formal languages can transformers express? A survey.
Transactions of the Association for Computational Linguistics, 12:543–561, 2024.
doi:10.1162/tacl_a_00663.
DOI
BibTeX
Brian DuSell and David Chiang.
Stack attention: improving the ability of transformers to model hierarchical patterns.
In Proc. ICLR. 2024.
Spotlight paper.
PDF
BibTeX
Stephen Bothwell, Justin DeBenedetto, Theresa Crnkovich, Hildegund Müller, and David Chiang.
Introducing rhetorical parallelism detection: a new task with datasets, metrics, and baselines.
In Proc. EMNLP, 5007–5039. 2023.
doi:10.18653/v1/2023.emnlp-main.305.
PDF
BibTeX
Aarohi Srivastava and David Chiang.
BERTwich: extending BERT's capabilities to model dialectal and noisy text.
In Findings of ACL: EMNLP. 2023.
PDF
BibTeX
Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, and David Chiang.
Universal automatic phonetic transcription into the International Phonetic Alphabet.
In Proc. INTERSPEECH. 2023.
doi:10.21437/Interspeech.2023-2584.
PDF
BibTeX
David Chiang, Peter Cholak, and Anand Pillay.
Tighter bounds on the expressivity of transformer encoders.
In Proc. ICML, 5544–5562. 2023.
PDF
BibTeX
Aarohi Srivastava and David Chiang.
Fine-tuning BERT with character-level noise for zero-shot transfer to dialects and closely-related languages.
In Proc. Workshop on NLP for Similar Languages, Varieties and Dialects. 2023.
PDF
BibTeX
David Chiang, Colin McDonald, and Chung-chieh Shan.
Exact recursive probabilistic programming.
PACMPL, 2023.
doi:10.1145/3586050.
PDF
BibTeX
David Chiang, Alexander M. Rush, and Boaz Barak.
Named tensor notation.
Transactions on Machine Learning Research, January 2023.
PDF
BibTeX
Alexandra Butoi, Brian DuSell, Tim Vieira, Ryan Cotterell, and David Chiang.
Algorithms for weighted pushdown automata.
In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proc. EMNLP, 9669–9680. 2022.
doi:10.18653/v1/2022.emnlp-main.656.
PDF
BibTeX
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and others.
Beyond the Imitation Game: quantifying and extrapolating the capabilities of language models.
Transactions on Machine Learning Research, 2023.
PDF
BibTeX
David Chiang and Peter Cholak.
Overcoming a theoretical limitation of self-attention.
In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proc. ACL, volume 1, 7654–7664. 2022.
doi:10.18653/v1/2022.acl-long.527.
PDF
BibTeX
Language and Computation at Notre Dame
People
- Meng Jiang: summarization and generation
- Toby Li: human-computer interaction
- Walter Scheirer: digital humanities and handwriting recognition
- John Lalor (ITAO): NLP and biomedical informatics
Courses
- CSE 40657/60657, Natural Language Processing, Prof. David Chiang
- CSE 40982, Interactive Dialogue Systems, Prof. Collin McMillan
- ITAO 40250, Unstructured Data Analytics, Prof. John Lalor
- AL 20301, Introduction to Linguistics, Prof. Hana Kang