Empirical Research Methods for CS
Spring 2026 — COS 598D
Establishing causal claims and evaluating systems and models are at the core of
modern Computer Science research. This course provides a rigorous introduction
to empirical methods, combining foundational theory on causal inference with
practical applications. Topics include causal diagrams, experimental and
quasi-experimental designs, regression analysis, and benchmarks.
Evaluation is based on programming assignments, active class
participation, and a final paper in which students develop and motivate a
research project proposal.
Table of Contents
- Basic Information
- Schedule
- About the course
- Detailed Schedule
2. Schedule
3. About the course
This course has three central components: 1) reading activities, 2) (guest) lectures, 3) homeworks and project.
3.1. Reading Activities
Most weeks will have an assigned reading. I expect you to read these and come prepared to have an informal discussion on the topic at the end of each class.
There is no assignment associated with the readings — I just expect you to do it.
Assigned readings when guest speakers come will help familiarize students with the topics that will be covered
3.2. (Guest) Lectures
Each week, I will give an expository lecture. In the first few weeks, this will take most of the course time.
After spring break, we will have a series of guest speakers, experts doing cutting-edge, relevant research.
In this case, half our class will be an expository lecture by yours truly, and the other half will be a presentation by the guest speaker.
Students are expected meaningfully engage with the lectures, and in particular with the guest speakers.
3.3. Homework and Project
There will be two homeworks and a project in the course. The first two homeworks will be jupyter notebooks that you must complete individually.
The third project will be to write a project “proposal,” using what you learned in the lectures. This can be done in groups or individually.
You are encouraged to mix this project with some research you are already doing, or are already interested in doing.
3.3.1 Homework 1
- Description: In a first homework, we will review concepts associated with the design and the analysis of experiments in a machine learning setting.
- Release date: Feb 12, 2026.
- Due date: Feb 26, 2026.
- Deliverables: A filled Jupyter notebook.
3.3.2 Homework 2
- Description: In a second homework, we will provide some nut-and-bolts details about running regressions.
- Release date: Feb 26, 2026.
- Due date: Mar 19, 2026.
- Deliverables: A filled Jupyter notebook.
3.3.3 Project
- Description:
- The final project is to write a research project proposal for an empirical study, using the tools and concepts from the course.
- From a paper perspective, think of this as writing the paper up to the Methods section.
- You may also obtain and present preliminary results, but results are not required and will not be graded.
- This project can be done individually or in groups (up to 3 students).
- You are encouraged to use this as an opportunity to advance your own research interests.
- Project types: Your proposal must describe one of the following:
- Experimental study (preferred when feasible):
Design an intervention or randomized experiment and specify:
1) treatment/control conditions (and what is randomized)
2) outcome(s) and measurement strategy
3) identification assumptions (why this design yields a causal claim)
4) power considerations or sample size planning (even if approximate)
5) analysis plan (estimand and statistical model).
- Observational / quasi-experimental study: Propose a study that estimates a causal effect using observational data, and specify:
1) research question + causal estimand (what effect are you estimating?)
2) data source(s) and unit of analysis
3) identification strategy (e.g., matching, diff-in-diff)
4) threats to validity and planned robustness checks
5) analysis plan (model + sensitivity checks).
- Benchmarking/evaluation project: Propose a benchmark or evaluation framework for a system/model, and specify:
1) what is being evaluated, and why existing evaluations are insufficient
2) dataset/task construction plan (including sampling and labeling)
3) metric design (and why it measures what you claim it measures)
4) baselines and ablations
5) expected failure modes and what conclusions would/wouldn’t be justified.
- Release date: Mar 19, 2026
- Due date: Apr 23, 2026
- Deliverables:
- 15-minute in-class presentation during the final session (Apr 23).
Your talk should communicate the motivation, the design/identification strategy, and the analysis plan clearly enough that someone else could implement it.
- Final report (project proposal) as a PDF.
- Length: ~6 pages (single column) .
- Format: any clear academic style (LaTeX encouraged).
- Evaluation criteria (what I will grade):
- Clarity and concreteness of the design
- Correct use of course concepts
- Thoughtfulness about threats/failure modes
- Quality of the analysis plan
- Communication quality
3.4 Grading
- 30% In-class active participation.
- 35% Homeworks
- 35% Project:
- 15% Presentation.
- 20% Final report.
3.5. Expectations
- I expect you to:
- Attend and actively participate in class.
- Be respectful and collegial to your classmates and guests.
Complete readings early and submit responses on time to help discussants.
Present your work when the time comes and serve as a discussant when necessary.
- Deadlines. Deadlines exist to help the class run smoothly. However, if you have any extenuating circumstances, please contact me about whether and how you can receive an extension. You must be proactive in letting me know so that we can plan together and others are not disrupted.
- A note on diversity and respectful conduct. This course welcomes all students of all backgrounds. You should expect and demand to be treated by your classmates and myself respectfully. If any incident challenges this commitment to a supportive, diverse, inclusive, and equitable environment, please let me know so the issue can be addressed.
- Disability, Religious, and family accommodations. If you have any questions about disability or religious accommodations, please refer to university policies. Feel free also to contact me for any reason.
- Academic integrity. We will follow the University’s rules and responsibilities guide.
4. Detailed Schedule
Jan 29
- Course Intro:
First, we will frame the course around why empirical reasoning has become central to modern CS: the field’s shift from theory and system building to observation, experiments, and evaluation in the wild.
Second, we will argue that some things are missing from the style of empiricism CS relies on.
Finally, we will talk about class logistics.
Feb 05
- Potential Outcomes:
We introduce the potential outcomes framework: estimands (ATE, ATT), counterfactual reasoning, and what it means to “identify” a causal effect. We will connect these ideas to evaluation of ML models.
- Reading: Wallach et al. “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge,” ICML 2025.
This piece motivates why evaluation in modern AI systems is often not just about accuracy—it requires careful thinking about constructs, measurement, and causal interpretation.
Feb 12
- Experiments:
We cover randomized experiments: randomization, compliance, treatment effects, interference, power, and common pitfalls (multiple testing, researcher degrees of freedom).
- Reading: Hochlehnert et al. “A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility,” COLM 2025.
This paper illustrates how empirical claims can be fragile when evaluation pipelines are underspecified or non-reproducible, reinforcing why experimental discipline (pre-specification, careful measurement, robust reporting) matters in ML research.
Feb 19
- Thinking with DAGs: We introduce causal DAGs as a language for assumptions: confounding, backdoor paths, colliders, mediation, and what it means to “control for” a variable. The emphasis is on reasoning about identification before doing statistics.
We also introduce matching and propensity scores as practical ways to construct comparable treatment and control groups.
- Reading: Causal Inference: The Mixtape (Chapter 3).
This chapter provides the core toolkit for translating causal questions into graphical assumptions, and it sets up the logic we’ll need for regression adjustment and “good vs. bad controls” in the next lectures.
Feb 26
- Regression #1:
We introduce linear regression from first principles (least squares) and use it as a tool for estimating average treatment effects. We build intuition for what it means to “control for” variables via partialling-out (Frisch–Waugh–Lovell), and extend the basic model with dummy variables and interactions to capture group differences and heterogeneous effects.
- Reading: Cinelli et al. “A Crash Course in Good and Bad Controls,” Journal of Sociological Methods and Research 2024.
This reading makes the “controls” question precise: which variables you should control for depends on the causal structure. It directly builds on DAGs and helps explain why many regressions silently rely on incorrect assumptions.
Mar 05
- Regression #2:
We focus on using regression responsibly in empirical work: heteroskedasticity and robust standard errors, basic model diagnostics, and best practices for reporting and interpreting results. We also introduce fixed effects as a practical way to control for unobserved, time-invariant differences across units and briefly discuss the world beyond linear regression.
- Reading: The Effect (Chapter 13).
This chapter connects your earlier causal identification ideas (potential outcomes + DAGs/backdoors) to the practical regression workflows people actually use. It discusses many of the intricacies of using regression in practice.
Mar 19
- Benchmarks:
We discuss benchmarks as scientific instruments: dataset construction, metric choice, distribution shift, and why repeated evaluation can create illusory progress. We’ll connect benchmark design to concepts like measurement validity and generalization.
- Reading: Recht et al. “Do ImageNet Classifiers Generalize to ImageNet?,” PMLR, 2019.
By recreating ImageNet-style test sets, the authors show that performance can shift meaningfully even when the intended task is “the same,” highlighting how benchmark scores can quietly bake in dataset-specific quirks and overstate how well models generalize. At the same time, the paper suggests that benchmark progress is not pure illusion: models that perform better on the original test set generally also perform better on the replicated test sets, even if absolute accuracy drops.
Mar 26
- Quasi-experiments: We cover causal inference without randomized experiments. We focus on two designs: regression discontinuity (RD), and difference in differences. The focus will be on how to articulate an estimand, justify identification assumptions, and run the core diagnostics that make quasi-experimental claims credible.
- Reading: Horta Ribeiro et al. “Automated Content Moderation Increases Adherence to Community Guidelines,” TheWebConf, 2023.
This paper is a concrete, large-scale example of quasi-experimental thinking “in the wild” in a CS-y setting.
Apr 02
- Causal ML: We discuss how modern machine learning can be used inside causal inference pipelines: flexible models can learn nuisance components, while carefully constructed estimators still target a well-defined causal estimand with valid uncertainty. The emphasis is on the “separation of concerns”: ML helps estimation, but identification still comes from assumptions and design.
- Reading: Statistical Inference on Predictive and Causal Effects in Modern Nonlinear Regression Models (Chapter 9).
This chapter is a principled introduction to double / debiased machine learning (DML): it explains why naively plugging ML predictions into causal estimators can introduce bias, and how Neyman-orthogonal moment equations plus cross-fitting (sample splitting) can deliver approximately unbiased estimates and valid inference even with high-dimensional, nonlinear nuisance models.
Apr 09
- Labelling with LLMs: We study LLM-based labeling as a measurement pipeline: when it works, when it fails, and how annotation choices (model, prompting, temperature, aggregation) can systematically shape downstream conclusions. We’ll connect this back to measurement validity and researcher degrees of freedom.
- Reading: Baumann et al., “Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation” arXiv.
This paper argues that seemingly innocuous labeling decisions can flip statistical conclusions (“LLM hacking”), quantifying the risks of false positives/negatives in studies that rely on LLM annotations and motivating stricter validation and robustness practices.
Apr 16
Apr 23
© 2025 The Trustees of Princeton University