Welcome to the web-site for


Bioinformatics II, Spring 2006


Lectures: Tue, Thur 2.30-3.45pm, 365 Willard (location subject to change)



Francesca Chiaromonte, Statistics, chiaro@stat.psu.edu , 505 Wartik, ph 5-7075.
Naomi Altman, Statistics, naomi@stat.psu.edu , 312 Thomas, ph 5-3791.


Office Hours:

F. Chiaromonte, Wed 1.30-3.30pm (or by appointment)

N. Altman, Wed 1.30-3.30pm (2.30-3.30 reserved for Bioinformatics II students), Fri 3.00-4.00pm.

Syllabus | Questions | Groups





Class on Tue Mar 14: F. Chiaromonte will reach University Park airport at 2.00pm, and hopefully make it to class at 2.30pm. BUT THERE COULD BE DELAYS.


Class on Tue Mar 28: this class is cancelled (F. Chiaromonte at ENAR conference)


Final projects: each group should be in touch with F. Chiaromonte and N. Altman to set up final projects. We should plan on having a good hold on them by Tue Apr 11.


PRESENTATIONS for the final projects on Apr 25 and 27 will be held during regular class time, but in Wartik 501 (central room of the 5th floor) instead of our usual Willard location.



Link to notes and materials for Naomi Altman's lectures.




A statistical Roadmap.


More on preprocessing MA data: Missing value imputation, other preliminary transformations, filtering.


Reading assignment: a paper that gives a good instance of how people think about  missing value imputation for microarray data is Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001), Missing value estimation methods for DNA microarrays, Bioinformatics 17(6):520-525. Consider the procedure proposed to evaluate various imputation methods in this paper, and refer to the concepts of missing completely at random, at random, not at random (as introduced in the lecture). Do you see any problems with the proposed evaluation procedure?


Basic patterns in data and (unsupervised) dimension reduction: Principal Components Analysis.

More details on principal components can be found in Multivariate Analysis text books (see list below).


Reading assignment: two papers introducing these techniques to the analysis of MA data are: N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar, and N.V. Fedoroff (2000), Fundamental patterns underlying gene expression profiles: Simplicity from complexity, PNAS 97: 8409-8414; O. Alter, P.O. Brown, and D. Botstein (2000), Singular value decomposition for genome-wide expression data processing and modeling, PNAS 97: 10101-10106.


Data analysis assignment: Instructions; data set yeast_cycle.txt; basic concepts for Resampling and Permuting. Work in groups and prepare a brief write-up of your results. HAND IN ON THUR MAR 23, AT THE END OF CLASS.


Characteristic patterns in data and (unsupervised) classification: Cluster Analysis.

Clustering and dimension reduction: An example.

Evaluating cluster solutions: How many clusters?

More details on cluster analysis can be found in Multivariate Analysis text books (see list below).


Reading assignment: Several papers have addressed the issue of selecting the number of clusters in MA data analysis. Some of the concepts presented in class can be found in: Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Proceedings of PSB 2002; Tibshirani R, Walther G, Hastie. T (2001): Estimating the Number of Clusters in a Dataset via the Gap Statistic, J. Royal Stat. Soc. B 63, 411-423 (as tech report); Dudoit S, Fridlyand J (2002) A prediction based resampling method for estimating the number of clusters in a data set. Genome Biology 3:research0036.1-0036.21.

Data analysis assignment: Instructions; data set yeast_shock.txt. (yeast_shock.xls also contains short descriptions for the genes). Work in groups and prepare a brief write-up of your results. HAND IN ON THUR APR 13 AT THE END OF CLASS.


More on gene clustering


Reading assignment: the following two papers describe more issues and techniques for gene clustering with MA data. They will be useful in working on the final projects. Bryan (2003) Problems in gene clustering based on gene expression data, Journal of Multivariate Analysis 90, 44-66 (see also references therein; related papers by the same author). Heyer, Kruglyak, and Yooseph (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes, Genome Research 9(11) 1106-1115.


Working with a response, and supervised dimension reduction: Linear Discriminant Analysis, Sliced Inverse Regression and the large-p-small-n issue.




Each group should prepare a presentation lasting approximately 20-25 minutes. All group members should be involved in describing the work (i.e. take turns in speaking) and be ready to answer questions. A hard copy of the presentation (or if you want an extended description of what you did) should be handed in to F. Chiaromonte right before your presentation. If you want a pdf file of your presentation to be posted on the class web-site, email it to F. Chiaromonte the evening before your presentation date.


Presentations on Tue April 25, 2.30-3.55pm

Presentations on Thur April 26, 2.30-3.55pm


An (evolving) list of useful links (in addition to the ones given for specific lectures)

T. Speed's group | G. Churchill's group | Stanford Stat's group | W. Li's bibliographic reference list |

Info on multiple imputation methods |


An (evolving) list of Reference Books

Statistical Analysis of Microarray Data:

Statistical Analysis of Gene Expression Microarray Data. Speed (ed.). Chapman & Hall.

The Analysis of Gene Expression Data: Methods and Software. Parmigiani, Garrett, Irizarry and Zeger (eds). Springer NY.

Analyzing Microarray Gene Expression Data. McLachlan, Do and Ambroise. Wiley NY.
Statistics for Microarrays. Wit and McClure. Wiley NY.

General Statistics:

Probability and Statistical Inference (5th ed). Hogg and Tanis. Prentice Hall.

R and S-plus:

Data Analysis and Graphics Using R. Maindonald and Braun. Cambridge Univ. Press.

Introductory Statistics with R. Dalgaard. Springer-Verlag.

Programming with Data, a Guide to the S Language. Chambers. Springer-Verlag.

S programming. Venables and Ripley. Springer-Verlag.

Modern Applied Statistics with S (4th ed). Venables and Ripley. Springer-Verlag.

Computational Statistics:

An Introduction to the Bootstrap. Efron and Tibshirani. Chapman & Hall CRC.

Permutation Tests (2nd ed). Good. Springer-Verlag.

Regression Methods (and related topics):

Applied Regression Including Computing and Graphics. Cook and Weisberg. Wiley NY.

Applied Regression Analysis. Draper and Smith. Wiley NY.

Multivariate Analysis:

Methods for Statistical Data Analysis of Multivariate Observations (2nd ed). Gnanadesikan. Wiley NY.

Multivariate Observations. Seber. Wiley NY.

Clustering Algorithms. Hartigan. Wiley NY.

Self Organizing Maps (2nd ed). Kohonen. Springer-Verlag.

Finding Groups in Data: An Introduction to Cluster Analysis. Kaufman and Rousseeuw. Wiley NY.


All Penn State and Eberly College of Science policies regarding academic integrity apply to this course. Learn more at the following site.