Stat 544 - Fall 2006
Categorical Data I

 

Home Calendar Syllabus Readings Lectures Assignments

 

Instructor

Department of Statistics

The Methodology Center

Penn State

   


Instructor. Joe Schafer, Associate Professor of Statistics.  My primary office is at The Methodology Center, located downtown at 204 E. Calder Way, Suite 400, My phone is 863-9795, email jls@stat.psu.edu. I also have a desk at the Department of Statistics,  422A Thomas Building, 863-8677.

Instructor's office hours. Monday and Wednesday, 2-3:30 pm in 422A Thomas, or by appointment at The Methodology Center.

Text. The "required" text for this course is Categorical Data Analysis by Alan Agresti (Second edition, 2002; Wiley). It may be possible to survive without purchasing this textbook, but this is the most comprehensive general reference on categorical data and is worth owning. Lecture material, readings and assignments will be drawn from portions of this Agresti. Another very useful book is Generalized Linear Models by McCullagh and Nelder (Second edition, 1989; Chapman and Hall). You do not have to purchase McCullagh and Nelder, but it's a classic and portions of our treatment of GLIM's will be drawn from it.

Content. In this course we will loosely follow the book by Agresti (2002), emphasizing parametric modeling of categorical data: properties of the multinomial distribution, goodness-of-fit measures, logistic regression, polychotomous regression, and loglinear models. Additional topics not in Agresti will be added near the end of the course. These topics, which pertain to the analysis of longitudinal and clustered data, include generalized estimating equations (GEE) and ML methods for generalized linear mixed models.

Theory versus application. In Stat 544 we will not analyze lots of data sets using canned statistical routines in SAS or other packages. Instead, we will learn the basic principles of categorical data analysis and the theory underlying the most popular categorical-data models, at roughly the same level of detail found in Agresti (2002). Data examples will be used when they help to illustrate theoretical principles.

Many people drive cars without knowing how they work. Similarly, many data analysts learn how to run and interepret logistic regression with minimal understanding of the underlying theory and principles. This course will enable you to "look under the hood" of logistic regression and other categorical data techniques, so that you can diagnose problems and perform repairs when necessary. It's also excellent preparation for those who want to do methodological research in categorical data analysis.

Prerequistites. Stat 544 was designed for graduate students in statistics, but students from other disciplines are welcome to attend if they are adequately prepared. Calculus-based probability and mathematical statistics (at roughly the 513-514 level) will be used throughout. Measure-theoretic probability will not be used. We will often draw analogies between categorical-data models and conventional linear models, so students are also expected to be familiar with regression and analysis of variance.

Computing. We will use R and SAS to illustrate specific techniques and to examine illustrative data examples. Students will occasionally be required to write their own code, e.g. to implement a simple Newton-Raphson algorithm, to perform a simulation, or to plot the contours of a loglikelihood function. R is recommended for these programming tasks, but other languages (Fortran, Gauss, Matlab, Minitab macros, etc.) may also be used.

Grading. Final grades will be based on written homework and class participation. A reasonable amount of collaboration on homework assignments is allowed, but each student must turn in written answers that reflect his or her own understanding of the material. In the initial weeks of the course, we will discuss the possibility of devoting the final two weeks to student presentations on topics of their own choosing. This will depend on the class size, student interest and judegement of the instructor.

Tentative schedule (subject to change).

  • Review of likelihood theory
  • The multinomial, product-multinomial and Poisson distributions; properties of ML estimates under saturated and restricted multinomial models; goodness-of-fit testing using deviance and Pearson's X^2
  • Two- and three-way tables and measures of association; sampling designs; approximate and exact tests for independence, conditional independence, symmetry and marginal homogeneity
  • Logistic regression; ML estimation by Newton-Raphson; tests and intervals; goodness-of-fit; diagnostics; overdispersion
  • Generalized linear models
  • Poisson regression
  • Loglinear models for multiway tables
  • Multinomial regression models for nominal and ordinal responses
  • Methods for longitudinal and clustered data

Note to students with disabilities. It is Penn State's policy to not to discriminate against qualified students with documented disabilities in its educational programs. If you have a disability related need for modifications in this course, contact your instructor and the Office for Disability Services (located in 116 Boucke Building) or the Disability Contact Liaison at your Penn State location. Instructors should be notified as early in the semester as possible. You may refer to the Nondiscrimination Policy in the Student Guide to University Policies and Rules 1997.