NLP+CSS 201: Beyond the basics
This website hosts the upcoming tutorial series for advanced NLP methods, for computational social science scholars.
Every few weeks, we will host some experts in the field of computational social science to present a new method in NLP, and to lead participants in an interactive exploration of the method with code and sample text data. If you are a graduate student or researcher who has some introductory knowledge of NLP (e.g. has learned text analysis from SICSS) and wants to “level up”, come join us!
- Introduction: the facilitators introduce their method and the code/data associated with the method.
- Interaction: the participants break out into small groups to test out the method using the provided code. The code includes several spots for experimenting with the method, which can help the participants better understand the benefits and limitations of the method. The participants may also bring their own data for analysis if desired.
- Conclusion: the facilitators bring all participants together to collect their experiences with the method and to share final thoughts on possible improvements on the method.
Tutorials will last one hour and we encourage participants to join live. However, we will also make the recordings and code publicly available afterwards.
- 10/13 (3PM EST/UTC-7): Comparing Word Embedding Models, with Connor Gilroy and Sandeep Soni
- We’ll demonstrate an extension of the use of word embedding models by fitting multiple models on a social science corpus (using gensim’s word2vec implementation), then aligning and comparing those models. This method is used to explore group variation and temporal change. We’ll discuss some tradeoffs and possible extensions of this approach.
- Pre-reading: Introductory tutorial on word embeddings
- Recorded video, Colab notebook (code and slides)
- 10/27: Extracting Information from Documents, with Andrew Halterman
- This workshop provides an introduction to information extraction for social science–techniques for identifying specific words, phrases, or pieces of information contained within documents. It focuses on two common techniques, named entity recognition and dependency parses, and shows how they can provide useful descriptive data about the civil war in Syria. The workshop uses the Python library spaCy, but no previous experience is needed beyond familiarity with Python.
- Pre-reading: n/a
- 11/10: Controlling for Text in Causal Inference with Double Machine Learning, with Emaad Manzoor
- Establishing causal relationships is a fundamental goal of scientific research. Text plays an increasingly important role in the study of causal relationships across domains especially for observational (non-experimental) data. Specifically, text can serve as a valuable “control” to eliminate the effects of variables that threaten the validity of the causal inference process. But how does one control for text, an unstructured and nebulous quantity? In this tutorial, we will learn about bias from confounding, motivation for using text as a proxy for confounders, apply a “double machine learning” framework that uses text to remove confounding bias, and compare this framework with non-causal text dimensionality reduction alternatives such as topic modeling.
- Pre-reading: Survey of text as a confounder (Keith et al. 2020), Application with text in causal inference (Manzoor et al. 2020), Explanation of Double Machine Learning (slides by Chris Felton)
- 11/24: Beyond the Bag Of Words: Text Analysis with Contextualized Topic Models, with Silvia Terragni
- Most topic models still use Bag-Of-Words (BoW) document representations as input. These representations, though, disregard the syntactic and semantic relationships among the words in a document, the two main linguistic avenues to coherent text. Recently, pre-trained contextualized embeddings have enabled exciting new results in several NLP tasks, mapping a sentence to a vector representation. Contextualized Topic Models (CTM) combine contextualized embeddings with neural topic models to increase the quality of the topics. Moreover, using multilingual embeddings allows the model to learn topics in one language and predict them for documents in unseen languages, thus addressing a task of zero-shot cross-lingual topic modeling.
- Pre-reading (optional): “Pretraining is a Hot Topic”, “Cross-lingual contextualized topic models”
- 12/8: BERT for Computational Social Scientists, with Maria Antoniak
- What is BERT? How do you use it? What kinds of computational social science projects would BERT be most useful for? Join for a conceptual overview of this popular natural language processing (NLP) model as well as a hands-on, code-based tutorial that demonstrates how to train and fine-tune a BERT model using HuggingFace’s popular Python library.
- Pre-reading: n/a
This tutorial series is organized by:
- Ian Stewart: post-doctoral fellow at University of Michigan; researches personalization and writing interventions for social applications
- Katie Keith: post-doctoral researcher at AI2 and incoming Assistant Professor at Williams College (Fall 2022); researches causal inference with text and general text-based computational social science applications.
We will send out the tutorial video link to our mailing list a few days before the tutorial starts. If you want to join the mailing list, subscribe here.
We are deeply grateful for financial assistance from a Social Science Research Council (SSRC)/Summer Institutes in Computational Social Science (SICSS) Research Grant.
Website theme adapted from Bulma Clean Theme.