Carnegie Mellon University

Statistics/Machine Learning Joint Ph.D. Degree

This unique program blends the power of statistics with cutting-edge machine learning, preparing students to tackle complex challenges at the intersection of both fields.

By engaging in interdisciplinary research and coursework across two dynamic departments, students gain a rich, dual perspective that equips them to drive innovation in data science. With access to leading experts, state-of-the-art resources, and diverse research opportunities, this program offers a rare opportunity to push the boundaries of statistical and machine learning knowledge. Graduates leave with the skills and expertise to lead in academia, industry, or beyond.

The Path to the Ph.D.

How to Apply to Stat/ML

Students interested in the Joint Statistics/Machine Learning Degree, must complete the Ph.D. Core Requirements as well as the Joint Ph.D. requirements from the ML department. 

Stat/ML Joint Program Requirements

Note: The Data Analysis Exam is not required for this joint Ph.D. program. However, it is necessary for obtaining the M.S. in Statistics.

Year 1

Fall

  • 36-705 - Intermediate Statistics
  • 36-707 - Applied regression 
  • 36-750 - Statistical Computing 
  • 36-699 - Statistical Immigration
Spring

  • 36-709 - Advanced Statistical Theory I
  • 36-708 - Statistical Machine Learning
  • 36-757 - Advanced Data Analysis I

Year 2

Fall

  • 10-715 - Advanced Introduction to Machine
  • 36-758 - Advanced Data Analysis II
Spring

  • 10-716 - Advanced Machine Learning
  • 10-725 - ML required elective, e.g. Convex optimization 
  • Research

Year 3

Prepare and deliver your thesis proposal.

Year 4 and beyond

Dedicated to dissertation research.

Program Requirements and Course Descriptions

Year 1 - Fall

Students are introduced to the faculty and their interests, the field of statistics, and the facilities at Carnegie Mellon. Each faculty member gives at least one elementary lecture on some topic of his or her choice. In the past, topics have included: the field of statistics and its history, large-scale sample surveys, survival analysis, subjective probability, time series, robustness, multivariate analysis, psychiatric statistics, experimental design, consulting, decision-making, probability models, statistics and the law, and comparative inference. Students are also given information about the libraries at Carnegie Mellon and current bibliographic tools. In addition, students are instructed in the use of the Departmental and University computational facilities and available statistical program packages.

This course covers the basics of statistics. We will first provide a quick introduction to probability theory, and then cover fundamental topics in mathematical statistics such as point estimation, hypothesis testing, asymptotic theory, and Bayesian inference. If time permits, we will also cover more advanced and useful topics including nonparametric inference, regression and classification. Prerequisites: one- and two-variable calculus and matrix algebra.

This course covers the fundamentals of theoretical statistics. Topics include: probability inequalities, point and interval estimation, minimax theory, hypothesis testing, data reduction, convergence concepts, Bayesian inference, nonparametric statistics, bootstrap resampling, VC dimension, prediction and model selection.

This course covers the basic principles of causality. Foundations of linear regression, including theory, computation, diagnostics, and generalized linear models. Extensions to nonparametric regression, including splines, kernel regression, and generalized additive models. Discussion of tools to compare statistical models, including hypothesis tests, cross-validation, and bootstrapping. Topics in nonparametric regression and machine learning as time permits, such as regression trees, boosting, and random forests. Emphasis on writing data analysis reports that answer substantive scientific methods with appropriate statistical tools. Students will be equipped with the tools needed to explore a substantive scientific question with data, translate scientific questions into statistical questions, compare different modeling approaches rigorously, and write their results in a clear manner.

A detailed introduction to elements of computing relating to statistical modeling, targeted to PhD students and masters students in Statistics & Data Science. Topics include important data structures and algorithms; numerical methods; databases; parallelism and concurrency; and coding practices, program design, and testing. Multiple programming languages will be supported (e.g., C, R, Python, etc.). Those with no previous programming experience are welcome but will be required to learn the basics of at least one language via self-study.

Year 1 - Spring

This course focuses on statistical methods for machine learning, a decades-old topic in statistics that now has a life of its own, intersecting with many other fields. While the core focus of this course is methodology (algorithms), the course will have some amount of formalization and rigor (theory/derivation/proof), and some amount of interacting with data (simulated and real). However, the primary way in which this course complements related courses in other departments is the joint ABCDE focus on (A) Algorithm design principles, (B) Bias-variance thinking, (C) Computational considerations (D) Data analysis (E) Explainability and interpretability.

This is a core Ph.D. course in theoretical statistics. The class will cover a selection of modern topics in mathematical statistics, focussing on high-dimensional parametric models and non-parametric models. The main goal of the course is to provide the students with adequate theoretical background and mathematical tools to read and understand the current statistical literature on high-dimensional models. Topics will include: concentration inequalities, covariance estimation, principal component analysis, penalized linear regression, maximal inequalities for empirical processes, Rademacher and Gaussian complexities, non-parametric regression and minimax theory. This will be the first part of a two semester sequence.

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques. During 36-757, you work with the seminar instructor to identify an ADA project for yourself. The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

Year 2 - Fall

Machine Learning is the primary pillar that Artificial Intelligence is built upon. This course is designed for Ph.D. students whose primary field of study is machine learning, and who intend to make machine learning methodological research a main focus of their thesis. It will give students a thorough grounding in the algorithms, mathematics, theories, and insights needed to do in-depth research and applications in machine learning. The topics of this course will in part parallel those covered in the general PhD-level machine learning course (10-701), but with a greater emphasis on depth in theory. Students entering the class are expected to have a pre-existing strong working knowledge of linear algebra, probability, statistics, and algorithms. The course will also involve programming in Python.

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques. During 36-757, you work with the seminar instructor to identify an ADA project for yourself. The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

Year 2 - Spring

This course is for students who have already taken introductory courses in machine learning and statistics, and who are interested in deeper theoretical foundations of machine learning, as well as advanced methods and frameworks used in modern machine learning. The course goals are to 1. Understand statistical and computational considerations in machine learning methods. 2. Develop the skill of devising computationally efficient and yet statistically rigorous algorithms for solving machine learning problems. 3. Understand the science of modern statistical analysis. 4. Develop the skill of quantifying the statistical performance of any new machine learning method.

Nearly every problem in machine learning can be formulated as the optimization of some function, possibly under some set of constraints. This universal reduction may seem to suggest that such optimization tasks are intractable. Fortunately, many real world problems have special structure, such as convexity, smoothness, separability, etc., which allow us to formulate optimization problems that can often be solved efficiently. This course is designed to give a graduate-level student a thorough grounding in the formulation of optimization problems that exploit such structure, and in efficient solution methods for these problems. The main focus is on the formulation and solution of convex optimization problems, though we will discuss some recent advances in nonconvex optimization. These general concepts will also be illustrated through applications in machine learning and statistics. Students entering the class should have a pre-existing working knowledge of algorithms, though the class has been designed to allow students with a strong numerate background to catch up and fully participate. Though not required, having taken 10-701 or an equivalent machine learning or statistical modeling class is strongly encouraged, as we will use applications in machine learning and statistics to demonstrate the concepts we cover in class. Students will work on an extensive optimization-based project throughout the semester.