NLP+CSS 201 Tutorials

Tutorials for advanced natural language processing methods designed for computational social science research.

Download .zip Download .tar.gz

This website hosts the upcoming tutorial series for advanced NLP methods, for computational social science scholars.

Every few weeks, we will host some experts in the field of computational social science to present a new method in NLP, and to lead participants in an interactive exploration of the method with code and sample text data. If you are a graduate student or researcher who has some introductory knowledge of NLP (e.g. has learned text analysis from SICSS) and wants to “level up”, come join us!

Watch past tutorials on our YouTube channel.

Tutorial format

  • Introduction: the facilitators introduce their method and the code/data associated with the method.
  • Interaction: the participants break out into small groups to test out the method using the provided code. The code includes several spots for experimenting with the method, which can help the participants better understand the benefits and limitations of the method. The participants may also bring their own data for analysis if desired.
  • Conclusion: the facilitators bring all participants together to collect their experiences with the method and to share final thoughts on possible improvements on the method.

Tutorials will last one hour and we encourage participants to join live. However, we will also make the recordings and code publicly available afterwards.

Logistics on joining tutorials live

We will send out the tutorial video link to our mailing list a few days before the tutorial starts. If you want to join the mailing list, subscribe here.


  • February-May 2022: topics include dialogue, phrase extraction, analytical uncertainty, causal inference, multilingual NLP, social media preprocessing.


See below for links to materials for previous tutorials.

Tutorial Description Leader Code Video Slides
Comparing Word Embedding Models We’ll demonstrate an extension of the use of word embedding models by fitting multiple models on a social science corpus (using gensim’s word2vec implementation), then aligning and comparing those models. This method is used to explore group variation and temporal change. We’ll discuss some tradeoffs and possible extensions of this approach. Connor Gilroy, Sandeep Soni link link N/A
Extracting Information from Documents This workshop provides an introduction to information extraction for social science–techniques for identifying specific words, phrases, or pieces of information contained within documents. It focuses on two common techniques, named entity recognition and dependency parses, and shows how they can provide useful descriptive data about the civil war in Syria. The workshop uses the Python library spaCy, but no previous experience is needed beyond familiarity with Python. Andrew Halterman link link N/A
Controlling for Text in Causal Inference with Double Machine Learning Establishing causal relationships is a fundamental goal of scientific research. Text plays an increasingly important role in the study of causal relationships across domains especially for observational (non-experimental) data. Specifically, text can serve as a valuable “control” to eliminate the effects of variables that threaten the validity of the causal inference process. But how does one control for text, an unstructured and nebulous quantity? In this tutorial, we will learn about bias from confounding, motivation for using text as a proxy for confounders, apply a “double machine learning” framework that uses text to remove confounding bias, and compare this framework with non-causal text dimensionality reduction alternatives such as topic modeling. Emaad Manzoor link link link
Beyond the Bag Of Words: Text Analysis with Contextualized Topic Models Most topic models still use Bag-Of-Words (BoW) document representations as input. These representations, though, disregard the syntactic and semantic relationships among the words in a document, the two main linguistic avenues to coherent text. Recently, pre-trained contextualized embeddings have enabled exciting new results in several NLP tasks, mapping a sentence to a vector representation. Contextualized Topic Models (CTM) combine contextualized embeddings with neural topic models to increase the quality of the topics. Moreover, using multilingual embeddings allows the model to learn topics in one language and predict them for documents in unseen languages, thus addressing a task of zero-shot cross-lingual topic modeling. Silvia Terragni link link N/A
BERT for Computational Social Scientists What is BERT? How do you use it? What kinds of computational social science projects would BERT be most useful for? Join for a conceptual overview of this popular natural language processing (NLP) model as well as a hands-on, code-based tutorial that demonstrates how to train and fine-tune a BERT model using HuggingFace’s popular Python library. Maria Antoniak link link link


This tutorial series is organized by:


  • Ian Stewart: post-doctoral fellow at University of Michigan; researches personalization and writing interventions for social applications


  • Katie Keith: post-doctoral researcher at AI2 and incoming Assistant Professor at Williams College (Fall 2022); researches causal inference with text and text-based social data science.


We are deeply grateful for financial assistance from a Social Science Research Council (SSRC)/Summer Institutes in Computational Social Science (SICSS) Research Grant.