Knowledge Base Population (KBP) aims to extract and populate structured knowledge from unstructured data. The state-of-the-art KBP paradigm includes three steps: (1). Some expert linguists define a schema about “what to link/fill in” such as entity and slot types for a specific data collection based on the needs of potential users and stakeholders, and write annotation guidelines for each type in the schema; (2). Human annotators follow the guidelines to annotate a certain amount of documents (a typical size is 500 documents); (3). Researchers write heuristic rules or design features and train supervised learning models from these manually annotated data. This paradigm is not fully automatic because it involves human in the loop for the first two steps. Both of them are very expensive, yet such a predefined schema can only cover a limited number of types. In addition, traditional KBP systems are highly dependent on linguistic resources tuned to the pre-defined schema, so they suffer from poor scalability and portability when moving to a new language, domain or genre. We propose a brand new “Liberal” KBP paradigm to combine the merits of traditional KBP/Information Extraction (IE) (high quality and fine granularity) and Open IE (high scalability). A Liberal KBP system can simultaneously discover a domain-rich schema and extract information units with fine-grained types. It has a “cold-start” (or with minimal supervision from some existing knowledge bases or schemas) and can be adapted to any domain, genres or language without any human annotated data. The only input to a Liberal KBP system is an arbitrary corpus, without any supervision, restriction or prior knowledge on its size, topic, or domain. The output also includes a schema discovered from the specific input corpus itself. The schema contains a flexible hierarchy of information unit types with multi-level granularities. Following the general principle of effectively leveraging both corpus statistics and linguistic knowledge, Liberal KBP combines symbolic semantics with knowledge discovered from distributional semantics using unsupervised learning. Experiments for multiple low- resource languages, multiple domains, and multiple genres demonstrate that Liberal KBP can discover new and more fine-grained schemas than both traditional KBP/IE and Open IE, and construct high-quality knowledge graphs for a new language/domain/genre overnight. Finally I will present the detailed quantitative and qualitative analysis on the remaining challenges and sketches future research directions for KBP.
Heng Ji is Edward P. Hamilton Development Chair Associate Professor in Computer Science Department of Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing and its connections with Data Mining, Network Science, Social Cognitive Science, Security and Vision. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, Sloan Junior Faculty Award in 2012, IBM Watson Faculty Award in 2012 and 2014, "Best of SDM2013" paper and "Best of ICDM2013" paper awards. She has been coordinating the NIST TAC Knowledge Base Population task since 2010. She served as the Information Extraction area chair for NAACL2012, ACL2013, EMNLP2013, NLPCC2014, EMNLP2015, WWW2015, NAACL2016 and ACL2016, the vice Program Committee Chair for IEEE/WIC/ACM WI2013 and CCL2015, the Financial Chair of IJCAI2016 and the Program Committee Chair of NLPCC2015. Her research is funded by the U.S. government (NSF, ARL, DARPA, AFRL and DHS) and industry (Google, Disney, IBM and Bosch).
Speech recognition has improved dramatically in recent years, but spoken dialog systems that we want to converse with have not materialized. When people converse, their words are less important than the goals they pursue in speaking to each other. Goals and intentions, however, are much more difficult to interpret than words. I argue, however, that it is not necessary for machines to engage in the same kinds of inference that humans use in conversation. Rather, machines can become more proficient dialog partners by capitalizing on the differences between humans and machines. This talk presents two experiments that illustrate gains in human-dialog performance based on a machine's ability to inspect what it knows. The first experiment demonstrates that a dialog system can have very successful conversations despite very poor speech recognition. The second demonstrates that efficient dialog strategies, meaning knowledge about how to achieve dialog goals, can be derived directly from the structured knowledge a dialog system has access to.
Rebecca J. Passonneau is the Director of the Center for Computational Learning Systems (CCLS) at Columbia University, and a Senior Research Scientist. Her research addresses how the same combination of words has different meanings in different contexts, for both written and spoken language. This question is a particularly challenging object of study, because context has many dimensions. Her recent work investigates content assessment for summarization and reading and writing skills, meaning and action in human-machine dialogue, data mining that links textual and non-textual sources, and word sense annotation and representation. She received her Ph.D. from the Department of Linguistics at the University of Chicago in 1985, and worked at several industry and academic research labs before joining CCLS in 2006. She has over 100 publications in journals and refereed conference proceedings, and has been Principal Investigator or co-Principal Investigator on 17 sponsored projects with funding from 11 government and corporate sources.
The popularity of Twitter and other social media platforms has yielded vast amounts of "conversational" text in electronic form. Researchers have tried to use this data to predict flu outbreaks, elections, movie revenues, public approval ratings, consumer confidence indicators, and more. All of these applications require some level of natural language processing on the raw text from Twitter.
In this talk I'll describe my research on entry-level NLP for Twitter, including the design of a part-of-speech tag set, its annotation, and an automatic tagger. The tagger produces state-of-the-art results on several datasets of conversational text by using word representations trained on a large set of unlabeled tweets. I'll also briefly discuss our work on using tweets to predict which team to bet on in NFL games, and highlight results on other social media datasets.
This is joint work with several collaborators at Carnegie Mellon University.
Kevin Gimpel is a research assistant professor at the Toyota Technological Institute at Chicago. He received his PhD in 2012 from the Language Technologies Institute at Carnegie Mellon, where he was advised by Noah Smith. His research focuses on natural language processing, focusing on applications like machine translation, syntactic analysis of social media, and text-driven forecasting of real-world events. He also works on machine learning motivated by NLP, including approximate inference for structure prediction and learning criteria for supervised and unsupervised learning. He was an organizer of the 2013 and 2014 Midwest Speech and Language Days workshops.
Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. While these methods are widely applicable, we will focus on two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.
David Smith is an assistant professor of computer science at Northeastern University and a founding member of the NULab for Texts, Maps, and Networks, Northeastern's center for digital humanities and computational social science. Before getting his PhD in CS, he worked in humanities computing—i.e., digital humanities before it was cool—at the Perseus Digital Library Project.
Natural language processing enables computers to learn how to interact with humans in their language (like Apple Siri), to translate between human languages (like Google Translate), and to extract information from large amounts of language data (like IBM Watson). To do these things, they need a mathematical-computational framework for structuring and interpreting language data. This talk will be in two parts: the first will be a preview of a course to be offered in Spring 2015 on natural language processing, and the second will be on a current area of research.
(1) In 1956, linguist Noam Chomsky identified a hierarchy of automata/grammars that might serve as models of how the human mind processes language. Since then, this hierarchy has been deeply studied in theoretical computer science (it is the subject of CSE 30151 here), but it is also incredibly useful in natural language technologies. I will give a tour of how the first two levels in the hierarchy, finite-state automata and context-free grammars, for strings and trees, respectively, form the basis for a wide array of applications in natural language processing.
(2) Although computational models of meaning (or semantics) are quite well studied, models for learning semantics from data are still relatively immature. What is needed is a set of tools, parallel to the tools we already have for strings and trees, for processing graphs. I will present some results from our ongoing search for a formal framework that would support such a set of tools. I am also seeking feedback for how such tools might be useful in other areas of network science outside of natural language.