NLP+CSS 201 Tutorials

Tutorials for advanced natural language processing methods designed for computational social science research.


This website hosts the completed tutorial series for advanced NLP methods, conducted from fall 2021 through spring 2022. The series is intended for computational social science scholars with some introductory knowledge (e.g. those who have learned text analysis from SICSS). Watch past tutorials on our YouTube channel.

Paper

The co-organizers summarized the planning and outcome of the NLP+CSS tutorials in a paper, which distills 5 key principles and a variety of lessons learned to guide future tutorial series in the space of “ML+X” research. The paper was published at the Teaching for NLP workshop at KONVENS 2023, and can be accessed here.

Archive

See below for links to materials for previous tutorials.

Tutorial Description Leader Links
Comparing Word Embedding Models We’ll demonstrate an extension of the use of word embedding models by fitting multiple models on a social science corpus (using gensim’s word2vec implementation), then aligning and comparing those models. This method is used to explore group variation and temporal change. We’ll discuss some tradeoffs and possible extensions of this approach. Connor Gilroy, Sandeep Soni Code; Video
Extracting Information from Documents This workshop provides an introduction to information extraction for social science–techniques for identifying specific words, phrases, or pieces of information contained within documents. It focuses on two common techniques, named entity recognition and dependency parses, and shows how they can provide useful descriptive data about the civil war in Syria. The workshop uses the Python library spaCy, but no previous experience is needed beyond familiarity with Python. Andrew Halterman Code; Video
Controlling for Text in Causal Inference with Double Machine Learning Establishing causal relationships is a fundamental goal of scientific research. Text plays an increasingly important role in the study of causal relationships across domains especially for observational (non-experimental) data. Specifically, text can serve as a valuable “control” to eliminate the effects of variables that threaten the validity of the causal inference process. But how does one control for text, an unstructured and nebulous quantity? In this tutorial, we will learn about bias from confounding, motivation for using text as a proxy for confounders, apply a “double machine learning” framework that uses text to remove confounding bias, and compare this framework with non-causal text dimensionality reduction alternatives such as topic modeling. Emaad Manzoor Code; Video; Slides
Beyond the Bag Of Words: Text Analysis with Contextualized Topic Models Most topic models still use Bag-Of-Words (BoW) document representations as input. These representations, though, disregard the syntactic and semantic relationships among the words in a document, the two main linguistic avenues to coherent text. Recently, pre-trained contextualized embeddings have enabled exciting new results in several NLP tasks, mapping a sentence to a vector representation. Contextualized Topic Models (CTM) combine contextualized embeddings with neural topic models to increase the quality of the topics. Moreover, using multilingual embeddings allows the model to learn topics in one language and predict them for documents in unseen languages, thus addressing a task of zero-shot cross-lingual topic modeling. Silvia Terragni Code; Video
BERT for Computational Social Scientists What is BERT? How do you use it? What kinds of computational social science projects would BERT be most useful for? Join for a conceptual overview of this popular natural language processing (NLP) model as well as a hands-on, code-based tutorial that demonstrates how to train and fine-tune a BERT model using HuggingFace’s popular Python library. Maria Antoniak Code; Video; Slides
Moving from words to phrases when doing NLP Most people starting out with NLP think of text in terms of single-word units called “unigrams.” But many concepts in documents can’t be represented by single words. For instance, the single words “New” and “York” can’t really represent the concept “New York.” In this tutorial, you’ll get hands-on practice using the phrasemachine package and the Phrase-BERT model to 1) extract multi-word expressions from a corpus of U.S. Supreme Court arguments and 2) use such phrases for downstream analysis tasks, such as analyzing the use of phrases among different groups or describing latent topics from a corpus. Abe Handler, Shufan Wang Code1; Code2; Slides1; Slides2; Video
Analyzing Conversations in Python Using ConvoKit ConvoKit is a Python toolkit for analyzing conversational data. It implements a number of conversational analysis methods and algorithms spanning from classical NLP techniques to the latest cutting edge, and also offers a database of conversational corpora in a standardized format. This tutorial will walk through an example of how to use ConvoKit, starting from loading a conversational corpus and building up to running several analyses and visualizations. Jonathan Chang Code; Video
Preprocessing Social Media Text 🤔 hmm howwww should we think about our #NLProc preprocessing pipeline when it comes to informal TEXT written by social media users?!? In this tutorial, we’ll discuss some interesting features of social media text data and how we can think about handling them when doing computational text analyses. We will introduce some Python libraries and code that you can use to process text and give you a chance to experiment with some real data from platforms like Twitter and Reddit. Steve Wilson Code; Video
Aggregated Classification Pipelines: Propagating Probabilistic Assumptions from Start to Finish NLP has helped massively scale-up previously small-scale content analyses. Many social scientists train NLP classifiers and then measure social constructs (e.g sentiment) for millions of unlabeled documents which are then used as variables in downstream causal analyses. However, there are many points when one can make hard (non-probabilistic) or soft (probabilistic) assumptions in pipelines that use text classifiers: (a) adjudicating training labels from multiple annotators, (b) training supervised classifiers, and (c) aggregating individual-level classifications at inference time. In practice, propagating these hard versus soft choices down the pipeline can dramatically change the values of final social measurements. In this tutorial, we will walk through data and Python code of a real-world social science research pipeline that uses NLP classifiers to infer many users’ aggregate “moral outrage” expression on Twitter. Along the way, we will quantify the sensitivity of our pipeline to these hard versus soft choices. Katherine Keith Code; Video; Slides
Estimating causal effects of aspects of language with noisy proxies Does the politeness of an email or a complaint affect how quickly someone responds to it? This question requires a causal inference: how quickly would someone have responded to an email had it not been polite? With observational data, causal inference requires ruling out all the other reasons why polite emails might be correlated with fast responses. To complicate matters, aspects of language such as politeness are not labeled in observed datasets. Instead, we typically use lexicons or trained classifiers to predict these properties for each text, creating a (probably noisy) proxy of the linguistic aspect of interest. In this talk, I’ll first review the challenges of causal inference from observational data. Then, I’ll use the motivating example of politeness and response times to highlight the specific challenges to causal inference introduced by working with text and noisy proxies. Next, I’ll introduce recent results that establish assumptions and a methodology under which valid causal inference is possible. Finally, I’ll demonstrate this methodology: we’ll use semi-synthetic data and adapt a text representation method to recover causal effect estimates. Dhanya Sridhar Code; Video; Slides;
Processing Code-mixed Text Code-mixing, i.e., the mixing of two or more languages in a single utterance or conversation, is an extremely common phenomenon in multilingual societies. It is amply present in user-generated text, especially in social media. Therefore, CSS research that handles such text requires to process code-mixing; there are also interesting CSS and socio-linguistic questions around the phenomenon of code-mixing itself. In this tutorial, we will equip you with some basic tools and techniques for processing code-mixed text, starting with hands-on experiments with word-level language identification, all the way up to methods for building code-mixed text classifiers using massively multilingual language models. Monojit Choudhury, Sanad Rizvi Code; Video; Slides
Word Embeddings for Descriptive Corpus Analysis: Digging Deeper into Analogies, Polysemy, and Stability Word embeddings such as word2vec have recently garnered attention as potentially useful tools for analysis in social science. They promise an unsupervised method to quantify the connotations of words, and compare these across time or different subgroups. However, when training or using word embeddings, researchers may find that they don’t work as well as expected, or produce unreplicable results. We focus on three subtle issues in their use that could result in misleading observations: (1) indiscriminate use of analogical reasoning, which has been shown to underperform on many types of analogies; (2) the surprising prevalence of polysemous words and distributional similarity of antonyms, both leading to counterintuitive results; and (3) instability in nearest-neighbor distances caused by sensitivity to noise in the training process. Through demonstrations, we will learn how to detect, understand, and most importantly mitigate the effects of these issues. Neha Kennard Code; Video; Slides

Hosts

This tutorial series was organized by:

ian_pic

  • Ian Stewart: senior scientist at Pacific Northwest National Laboratory (Fall 2022); researches personalization and writing interventions for social applications

katie_pic

  • Katie Keith: Assistant Professor at Williams College (Fall 2022); researches causal inference with text and text-based social data science.

Acknowledgments

We are deeply grateful for financial assistance from a Social Science Research Council (SSRC)/Summer Institutes in Computational Social Science (SICSS) Research Grant.