Title: Inducing Complex Instructions
Abstract: Instruction following has emerged as an effective alternative to fine tuning large language models (LLMs), as it is interpretable, requires fewer resources, and requires less expertise to implement than fine tuning. In practice, however, many tasks require complex instructions made up of many smaller atomic instructions, which are difficult to write for users with limited prompt engineering experience. Further, the more complex the task is, the more trial and error is needed to accomodate the corner cases. Existing approaches to this problem use Instruction Induction where input output pairs for a given task are provided with the goal of learning the instrucion that maps from inputs to outputs. However, existing approaches significantly struggle to induce an effective instruction for complex tasks. In this presentation, I cover the existing approaches to Instruction Induction/Prompt Optimization, I introduce my proposed approach for their limitations, and I explore the exciting downstream datasets that can be created for complex instruction induction.
Bio: Noah Ziems is a second year PhD student in the Department of Computer Science and Engineering at the University of Notre Dame, and is a member of Dr. Meng Jiang’s DM2 lab. His research is focused on instruction induction/prompt engineering, large language models, and information retrieval.
]]>Title: Language Identification for Under-Resourced Languages
Abstract: Knowing the language of an input text/audio is a necessary first step for using almost any NLP tool such as taggers, parsers, or translation systems. This is called “language identification (LangID)” and is a well-studied problem, sometimes even considered solved. However, even now, state-of-the-art systems cannot accurately identify most of the world’s 7000 languages, which demonstrates scope for improvement in LangID modeling and design. In this talk, we will present our misprediction-based hierarchical model (LIMIT) that can improve predictions from LangID systems in post-processing. Our technique reduces error by 40-55% on various benchmarks and can be useful in reducing the needle-in-a-haystack nature of low-resource data. We will also discuss our follow-up work that addresses heavy script-reliance in current LangID systems, a flaw in our view that hurts low-resource languages the most, and how it might be beneficial to adopt script-agnostic LangID, especially for Indian languages.
Bio: Milind Agarwal is a second-year Ph.D. student in the NLP group at George Mason University, advised by Dr. Antonios Anastasopoulos. His research interests are centered around foundational problems such as language identification and scalable resource creation using OCR. He is currently working on developing techniques to better identify extremely under-resourced languages in the wild, and using these to better extract existing data online.
]]>Title: Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning
Abstract: As the domain of large language models (LLMs) continues to evolve, tailoring these models to meet individual user preferences has emerged as a critical endeavor. This presentation will explore the advances of LLM personalization, introducing our innovative contribution: One PEFT Per User (OPPU). OPPU introduces a novel approach to store and capture user-specific behavior patterns and preferences within personal PEFT modules. OPPU enhances the LLMs’ capability to capture the nuances of user behavior, especially in complex contexts. OPPU also empowers users with model ownership, effectively mitigating customization constraints. Experiments on the LaMP benchmark across seven varied tasks showcases OPPU’s superiority over conventional prompt-based personalization methods. We will explore the mechanisms behind OPPU’s success, including its adaptability to shifts in user behavior, its effectiveness across a spectrum of user engagement levels, its robustness in handling diverse user history data, and the versatility of OPPU with different PEFT strategies.
Bio: Zhaoxuan Tan is a first-year CSE PhD student at Notre Dame, advised by Prof. Meng Jiang. His research interest lies at the intersection of NLP and data mining, with a focus on user modeling. He is currently working on personalizing large language models. He has published multiple papers in top-notch AI conferences, such as NeurIPS, EMNLP, ACL, WWW, AAAI, and SIGIR.
]]>Title: Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis
Abstract: After a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.
Bio: Ruiyang Qin is a second-year PhD student in Computer Science and Engineering (CSE) at the University of Notre Dame, working in Dr. Yiyu Shi’s Sustainable Computing Laboratory. He earned both his Bachelor of Science (BS) and Master of Science (MS) degrees in Computer Science from Georgia Tech. His research focuses on enabling personalized deep learning models on edge devices, with a particular interest in their practical applications in healthcare. Ruiyang has published papers in prestigious conferences including the Design Automation Conference (DAC), the Asia and South Pacific Design Automation Conference (ASP-DAC), and the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), showcasing his significant contributions to the field.
]]>Title: TOFU: A Task of Fictitious Unlearning for LLMs
Abstract: Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.
Bio: Zheyuan (Frank) Liu is a first year PhD student in Dr.Jiang’s DM2 Lab at Notre Dame. His research interests are centered around LLM security, LLM privacy and responsible AI. He is currently working on developing a novel framework to efficiently unlearn harmfulness from large language model.
]]>Title: Cross-Lingual Biases and Cultural Understanding in LLMs
Abstract: Human biases are ubiquitous but not uniform: disparities exist across linguistic, cultural, and societal borders. As large amounts of recent literature suggest, language models (LMs) trained on human data can reflect and often amplify the effects of these social biases. However, most existing studies on bias are heavily skewed towards Western and European languages. In our work (EMNLP’23), we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies and yielding interesting findings about LM bias. We additionally enhance this data with culturally relevant information for each language, capturing local contexts on a global scale. Further, to encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more. We also briefly talk about an extension of this work that explores cultural understanding in LLMs and how that is linked with language.
Bio: Anjishnu Mukherjee is a second-year Ph.D. student in the NLP group at George Mason University, advised by Dr. Antonios Anastasopoulos. His research interests are centered around cross-lingual understanding of culture and related social biases. He is currently working on developing metrics for nuanced measurements of cultural differences and ways to mitigate them in language models.
]]>Title: On the Expressivity of Transformer Encoders
Abstract: Transformers have gained prominence in natural language processing (NLP), both in direct applications like machine translation and in pretrained models like BERT and GPT. Lately, empirical work has noted significant limits, heuristics, and perplexing behavior in transformer models. Thus, formal investigation into these models’ theoretical properties can provide valuable insight into what these models can and can’t do. Today, we explore recent developments towards understanding the formal expressivity of transformers.
Bio: Bio: Andy J Yang is a first year PhD student in the NLP lab at Notre Dame, advised by David Chiang. He is interested in linguistics, model theory, machine learning, and their intersections. He hopes theoretical insights will enable researchers and engineers to reliably create helpful language-processing systems.
]]>Title: Building an Automatic Speech Recognition Dataset and Model for an Extremely Low-Resource Language
Abstract: Modern NLP research and applications have centered around a handful of high-resource languages, while more than 99% of the world’s languages are seldom or never mentioned. Ideally, NLP should be able to include all the human languages equally, as long as NLP encompasses natural languages as a scientific field. The biggest obstacle that hinders this idea is lack of readily available data, commonly called being “low-resource” in the NLP literature, due to the small number of speakers, absence of orthography, technological underdevelopment, sociopolitical marginalization, and language endangerment and extinction. In this presentation, we introduce our effort to develop an Automatic Speech Recognition (ASR) model for Kichwa, an endangered language spoken in Ecuador, collaborating with the speaker community. We created the first Kichwa ASR dataset by collecting spoken Kichwa data online and annotating them with transcriptions. The experiments show that the model trained only on 4-hour audio data with pretrained multilingual model (Wav2Vec2-XLSR-53) performs well with 3.25% character error rate. This grassroots project exemplifies the inclusion of marginalized languages in NLP for accelerated language documentation and revitalization.
Bio: Chihiro Taguchi is a second-year Ph.D. student in the NLP group, advised by Dr. David Chiang. His research interests broadly include language sciences, in particular both text-based and speech-based NLP and theoretical linguistics. He is currently working on the project “Language Documentation with an AI Helper”. He studied the Kichwa language at Notre Dame in Spring 2023 and visited Ecuador for five weeks to further study Kichwa.
]]>Title: Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation
Abstract: Thanks to their generative capabilities, large language models (LLMs) have become an invaluable tool for creative processes. These models have the capacity to produce hundreds and thousands of visual and textual outputs, offering abundant inspiration for creative endeavors. But are we harnessing their full potential? We argue that current interaction paradigms fall short, guiding users towards rapid convergence on a limited set of ideas, rather than empowering them to explore the vast latent design space in generative models. To address this limitation, we propose a framework that facilitates the structured generation of design space in which users can seamlessly explore, evaluate, and synthesize a multitude of responses. We demonstrate the feasibility and usefulness of this framework through the design and development of an interactive system, Luminate, and a user study with 8 professional writers. Our work advances how we interact with LLMs for creative tasks, introducing a way to harness the creative potential of LLMs.
Bio: Meng Chen is a senior undergraduate studying Computer Science and Philosophy at the University of Notre Dame where he was advised by Dr. Toby Jia-jun Li. Meng is interested in human-computer interaction and cognitive science. Recently, he is working on developing interfaces that reduce the cost of disambiguation and communication between humans and AI.
]]>Title: BERTwich: Extending BERT’s Capabilities to Model Dialectal and Noisy Text
Abstract: Real-world NLP applications often deal with nonstandard text (e.g., dialectal, informal, or misspelled text). However, language models like BERT deteriorate in the face of dialect variation or noise. How do we push BERT’s modeling capabilities to encompass nonstandard text? Fine-tuning helps, but it is designed for specializing a model to a task and does not seem to bring about the deeper, more pervasive changes needed to adapt a model to nonstandard language. In this paper, we introduce the novel idea of sandwiching BERT’s encoder stack between additional encoder layers trained to perform masked language modeling on noisy text. We find that our approach, paired with recent work on including character-level noise in fine-tuning data, can promote zero-shot transfer to dialectal text, as well as reduce the distance in the embedding space between words and their noisy counterparts.
Bio: Aarohi Srivastava is a third-year PhD student in the NLP group, advised by David Chiang. She is particularly interested in language modeling in nonstandard and low-resource settings.
]]>