Welcome to the web-site for
Bioinformatics II, Spring 2006
Lectures: Tue, Thur 2.30-3.45pm, 365 Willard (location subject to change)
Instructors:
Francesca Chiaromonte,
Statistics, chiaro@stat.psu.edu , 505
Wartik, ph 5-7075.
Naomi Altman, Statistics,
naomi@stat.psu.edu , 312 Thomas, ph 5-3791.
Office Hours:
F. Chiaromonte, Wed 1.30-3.30pm (or by appointment)
N. Altman, Wed 1.30-3.30pm
(2.30-3.30 reserved for Bioinformatics II students), Fri 3.00-4.00pm.
ANNOUNCEMENTS
Class on Tue Mar 14: F. Chiaromonte will reach University Park airport at 2.00pm, and hopefully make it to class at 2.30pm. BUT THERE COULD BE DELAYS.
Class on Tue Mar 28: this class is cancelled (F. Chiaromonte at ENAR conference)
Final projects: each group should be in touch with F. Chiaromonte and N. Altman to set up final projects. We should plan on having a good hold on them by Tue Apr 11.
PRESENTATIONS for the final projects on Apr 25 and 27 will be held during regular class time, but in Wartik 501 (central room of the 5th floor) instead of our usual Willard location.
Link to notes and materials for Naomi Altman's lectures.
LECTURES and ASSIGNMENTS
More on preprocessing MA data: Missing value imputation, other preliminary transformations, filtering.
Reading assignment: a paper that gives a good instance of how people think about missing value imputation for microarray data is Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001), Missing value estimation methods for DNA microarrays, Bioinformatics 17(6):520-525. Consider the procedure proposed to evaluate various imputation methods in this paper, and refer to the concepts of missing completely at random, at random, not at random (as introduced in the lecture). Do you see any problems with the proposed evaluation procedure?
Basic patterns in data and (unsupervised) dimension reduction: Principal Components Analysis.
More details on principal components can be found in Multivariate Analysis text books (see list below).
Reading assignment: two papers introducing these techniques to the analysis of MA data are: N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar, and N.V. Fedoroff (2000), Fundamental patterns underlying gene expression profiles: Simplicity from complexity, PNAS 97: 8409-8414; O. Alter, P.O. Brown, and D. Botstein (2000), Singular value decomposition for genome-wide expression data processing and modeling, PNAS 97: 10101-10106.
Data analysis assignment: Instructions; data set yeast_cycle.txt; basic concepts for Resampling and Permuting. Work in groups and prepare a brief write-up of your results. HAND IN ON THUR MAR 23, AT THE END OF CLASS.
Characteristic patterns in data and (unsupervised) classification: Cluster Analysis.
Clustering and dimension reduction: An example.
Evaluating cluster solutions: How many clusters?
More details on cluster analysis can be found in Multivariate Analysis text books (see list below).
Reading
assignment: Several papers have addressed the issue of selecting the number
of clusters in MA data analysis. Some of the concepts presented in class can be
found in: Ben-Hur A, Elisseeff A, Guyon I (2002)
A stability based
method for discovering structure in clustered data. Proceedings of PSB
2002; Tibshirani R, Walther G, Hastie. T (2001): Estimating the Number of
Clusters in a Dataset via the Gap Statistic, J. Royal Stat. Soc. B 63,
411-423 (as tech report);
Dudoit S, Fridlyand J (2002)
A prediction based
resampling method for estimating the number of clusters in a data set.
Genome Biology 3:research0036.1-0036.21.
Data analysis assignment: Instructions; data set yeast_shock.txt. (yeast_shock.xls also contains short descriptions for the genes). Work in groups and prepare a brief write-up of your results. HAND IN ON THUR APR 13 AT THE END OF CLASS.
Reading assignment: the following two papers describe more issues and techniques for gene clustering with MA data. They will be useful in working on the final projects. Bryan (2003) Problems in gene clustering based on gene expression data, Journal of Multivariate Analysis 90, 44-66 (see also references therein; related papers by the same author). Heyer, Kruglyak, and Yooseph (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes, Genome Research 9(11) 1106-1115.
Working with a response, and supervised dimension reduction: Linear Discriminant Analysis, Sliced Inverse Regression and the large-p-small-n issue.
FINAL PROJECTS:
Each group should prepare a presentation lasting approximately 20-25 minutes. All group members should be involved in describing the work (i.e. take turns in speaking) and be ready to answer questions. A hard copy of the presentation (or if you want an extended description of what you did) should be handed in to F. Chiaromonte right before your presentation. If you want a pdf file of your presentation to be posted on the class web-site, email it to F. Chiaromonte the evening before your presentation date.
Presentations on Tue April 25, 2.30-3.55pm
Group A: Holly Preston, Bing Han, Ritendra Datta. Effects of exercise on gene expression in bone tissue of mice. The data consists of 4 replicates for each of two regimes (exercise, no exercise). The group will investigate differentially expressed genes between the two conditions. Suggestions: also perform a cluster analysis of the differentially expressed genes in a two dimensional space (one dimension for each regime; note that replicate values for each gene in each regime will need to be summarized). Investigate how the clustering results are affected by varying the stringency used in identifying differentially expressed genes. Note: since there are replicates, in addition to other approaches, cluster solutions can be evaluated by bootstrapping replicates (see Bryan, 2003, and references therein).
Group B: Aakrosh Ratan, Rong Liu, Yo Kelkar. Relationships between length of genic microsatellites and transcription levels in testis and ovaries. Data on microsatellite length and other important covariates and annotations will be augmented with publicly available data on gene expression in testis and ovaries - this data comprises replicates. The group will perform a regression analysis to explain genic microsatellite length as a function of these variables. Suggestions: also perform a cluster analysis of the genes hosting microsatellites in a two dimensional space (testis, ovaries; note that replicate values for each gene in each tissue will need to be summarized). Investigate how the clustering relates to the microsatellite characterization of the genes (e.g. number of microsatellites hosted, whether they are in exons, etc.). Note: since there are replicates, in addition to other approaches, cluster solutions can be evaluated by bootstrapping replicates (see Bryan, 2003, and references therein).
Group D: Erika Kvikstad,
Svitlana Tyekucheva, Hsu Chi-Hao. Indel length and frequency as a function
of genomic features and transcription levels. Based on data for (1)
insertions and deletions (indels) identified using human-chimpanzee
alignments; (2) genome features for 3Mb size windows across
the human genome, including: human diversity, human-chimpanzee divergence,
human recombination rate, and (possibly) G-C content (Hellmann et al, 2005);
(3) human gene expression across 79 tissues including testis and ovaries (Su
et al, 2004), we will investigate the following. Can we predict length of
indel events based on genomic features? Are indels events preferred or reduced
in regions of highly expressed
genes? We will use sliced inverse regression to produce linear
combinations of predictors to use for regression analysis to explain the
variability of our data. The unit of our analysis will be the 3
Mb window. Our responses will include average length of indel per window, and
frequency of indels per window. Our predictors will include the above genome
features, expression strength measured as the mean expression per gene across
all tissues, and mean expression per gene in germline tissues (testis
and ovaries only).
Presentations on Thur April 26, 2.30-3.55pm
Group C: Song Yang,
Madhukar Dasika, Lu Zhang. Supervised Classification of Leukaemia
types.
Group E: Chungoo Park, Gil Tae Song, Min Kyung Kim. Gene expression divergence of duplicate genes in human genome. Using human expression data in 79 tissues f(rom U133A and GNF1H affymetrix, normalize by global median scaling, Su et al. 2004, investigate whether genes in families (which are formed by sequence similarity) are similar in terms of expression profiles. Approaches to this investigation include (1) computing expression profile distance for all pairs of genes in a family, and assessing whether these distances are significantly smaller than would be expected for a random collection of genes having the same size of the family. (2) Clustering genes using expression profile, and check whether these clusters are consistent with gene families; i.e. whether duplicate genes tend to reside in the same expression clusters. We expect that a certain number of duplicate genes will show different expression profiles even when their sequence similarity is high; to investigate possible reasons for this, we will check whether cis-regulatory regions are duplicated along with genes, and the density of transposable elements in these regions.
Additional presentation: Melissa Wilson. A mechanism for genome evolution: Retention of duplicated genes by silencing. It is hypothesized that genome growth can occur by allowing a duplicated gene copy to remain in the genome through a silencing mechanism based on nucleotide identity, rather than be deleted or pseudo-genized. The evidence for this would come in the form of a footprint between duplicated genes in which there is a significant depletion of long stretches of homology, while overall sequence similarity remains high. The initial analysis is done on duplicate genes found in Arabidopsis thaliana. These gene pairs are identified and screened using BLAST and PAML. Analysis on these pairs include percent identity, age, distance and a unique p-value analysis that determines the significance of observed versus an expected null distribution of lengths of exact homology between the two genes.
An (evolving) list of useful links (in addition to the ones given for specific lectures)
T. Speed's group | G. Churchill's group | Stanford Stat's group | W. Li's bibliographic reference list |
Info on multiple imputation methods |
An (evolving) list of Reference Books
Statistical Analysis of Microarray Data:
Statistical Analysis of Gene Expression Microarray Data. Speed (ed.). Chapman & Hall.
The Analysis of Gene Expression Data: Methods and Software. Parmigiani, Garrett, Irizarry and Zeger (eds). Springer NY.
Analyzing Microarray Gene
Expression Data. McLachlan, Do and Ambroise. Wiley NY.
Statistics for Microarrays. Wit and McClure. Wiley NY.
General Statistics:
Probability and Statistical Inference (5th ed). Hogg and Tanis. Prentice Hall.
R and S-plus:
Data Analysis and Graphics Using R. Maindonald and Braun. Cambridge Univ. Press.
Introductory Statistics with R. Dalgaard. Springer-Verlag.
Programming with Data, a Guide to the S Language. Chambers. Springer-Verlag.
S programming. Venables and Ripley. Springer-Verlag.
Modern Applied Statistics with S (4th ed). Venables and Ripley. Springer-Verlag.
Computational Statistics:
An Introduction to the Bootstrap. Efron and Tibshirani. Chapman & Hall CRC.
Permutation Tests (2nd ed). Good. Springer-Verlag.
Regression Methods (and related topics):
Applied Regression Including Computing and Graphics. Cook and Weisberg. Wiley NY.
Applied Regression Analysis. Draper and Smith. Wiley NY.
Multivariate Analysis:
Methods for Statistical Data Analysis of Multivariate Observations (2nd ed). Gnanadesikan. Wiley NY.
Multivariate Observations. Seber. Wiley NY.
Clustering Algorithms. Hartigan. Wiley NY.
Self Organizing Maps (2nd ed). Kohonen. Springer-Verlag.
Finding Groups in Data: An Introduction to Cluster Analysis. Kaufman and Rousseeuw. Wiley NY.