A Markov chain Monte Carlo example

Adapted from notes by Murali Haran, Dept. of Statistics, Penn State University

Introduction

Monte Carlo works as follows: Suppose we want to estimate an expectation of a function g(x) with respect to the probability distribution f. We denote this desired quantity m= E

A toy example to calculate P(-1 < X < 0) when X is a Normal(0,1) random variable:

xs <- rnorm(10000) # simulate 10,000 draws from N(0,1) xcount <- sum((xs>-1) & (xs<0)) # count number of draws between -1 and 0 xcount/10000 # Monte Carlo estimate of probability pnorm(0) - pnorm(-1) # Compare it to R answer (cdf at 0) - (cdf at -1)

Importance sampling is powerful in a number of situations, including:

- When expectations with respect to several different
distributions (say f
_{1},..,f_{p}) are of interest. All these expectations can, in principle, be estimated by using just a single set of samples! - When rare event probabilities are of interest so ordinary Monte Carlo would take a huge number of samples for accurate estimates. In such cases, selecting q appropriately can produce much more accurate estimates with far fewer samples.

xs <- rnorm(10000) # simulate 10,000 draws from N(0,1) mean(xs>4) # Estimate probability empirically 1-pnorm(4) # Compare it to R answer ys <- rexp(10000)+4 # Try importance sampling with q(x)=exp(-(x-4)) for x>4. mean(dnorm(ys)/exp(4-ys)) # Much more precise estimator!Discussions of importance sampling in astronomical Bayesian computation appear in papers by Lewis & Bridle and Trotta for cosmological parameter estimation and Ford for extrasolar planet modeling.

R has random number generators for most standard distributions and there are many more general algorithms (such as rejection sampling) for producing independent and identically distributed (i.i.d.) draws from f. Another very general approach for producing non-i.i.d. draws (approximately) from f is the Metropolis-Hastings algorithm.

Problem and model description

Our example uses a dataset from the Chandra Orion Ultradeep Project (COUP). This is a time series of X-ray emission from a flaring young star in the Orion Nebula Cluster. More information on this dataset is available at: CASt Chandra Flares data set . The raw data, which arrives approximately according to a Poisson process, gives the individual photon arrival times (in seconds) and their energies (in keV). The processed data we consider here is obtained by grouping the events into evenly-spaced time bins (10,000 seconds width).

We assume that the Poisson process is piecewise homogeneous with a single change point. Our goal for this data analysis is to identify the change point and estimate the intensities of the Poisson process before and after the change point.

We describe a Bayesian model for this change point problem (Carlin and Louis, 2000). Let Y

We first read in the data:

chptdat = read.table("http://www.stat.psu.edu/~dhunter/astrostatistics/mcmc/COUP551_rates.dat",skip=1)

We can begin with a simple time series plot as exploratory analysis.

Y=chptdat[,2] # store data in Y ts.plot(Y,main="Time series plot of change point data")The plot suggests that the change point may be around 10.

Setting up the MCMC algorithm

Our goal is to simulate multiple draws from the posterior distribution which is a multidimensional distribution known only upto a (normalizing) constant. From this multidimensional distribution, we can easily derive the conditional distribution of each of the individual parameters (one dimension at a time). This is described, along with a description of the Metropolis-Hastings algorithm in changePointExample.pdf.

Programming an MCMC algorithm in R

We will need an editor for our program. For instance, we can use emacs (with ESS, or emacs speaks statistics, installed). Alternatively, R comes with its own editor, which is good enough for our purposes.

Please save code from MCMC template in R into a file and open this file using the editor. Save this file as MCMCchpt.R . To load the program from the file MCMCchpt.R we use the "source" command. (It may be helpful to type:

setwd("V:/")to set the default directory to the place where you can save your files.)

source("MCMCchpt.R") # with appropriate filepathnameWe can now run the MCMC algorithm:

mchain <- mhsampler(dat=Y, MCMCiterations=5000) # call the function with appropriate argumentsAlternatively, we can run the function and keep the value of k fixed at 10 for each iteration:

mchain2 <- mhsampler(dat=Y, MCMCiterations=5000, kfixed=TRUE)

MCMC output analysis

Now that we have output from our sampler, we can treat these samples as data from which we can estimate quantities of interest. For instance, to estimate the expectation of a marginal distribution for a particular parameter, we would simply average all draws for that parameter so to obtain an estimate of E(theta):

mean(mchain[1,]) # obtain mean of first row (thetas)To get estimates for means for all parameters:

apply(mchain,1,mean) # compute means by row (for all parameters at once) apply(mchain,1,median) # compute medians by row (for all parameters at once)To obtain an estimate of the entire posterior distribution:

plot(density(mchain[1,]),main="smoothed density plot for theta posterior") plot(density(mchain[2,]),main="smoothed density plot for lambda posterior") hist(mchain[3,],main="histogram for k posterior")To find the (posterior) probability that lambda is greater than 10

sum(mchain[2,]>10)/length(mchain[2,])You can run similar types of analyses on the mchain2 object, though in that case all of the k values (in the 3rd row) are fixed at 10. You can also study how your estimate for the expectation of the posterior distribution for k changes with each iteration.

install.packages("batchmeans") # We need this package for the estvssamp function library(batchmeans) # Load the package (different from installing it) estvssamp(mchain[3,])We would like to assess whether our Markov chain is moving around quickly enough to produce good estimates (this property is often called 'good mixing'). While this is in general difficult to do rigorously,

acf(mchain[1,],main="acf plot for theta") acf(mchain[2,],main="acf plot for lambda") acf(mchain[3,],main="acf plot for k") acf(mchain[4,],main="acf plot for b1") acf(mchain[5,],main="acf plot for b2")If the samples are heavily autocorrelated we should rethink our sampling scheme or, at the very least, run the chain for much longer. Note that the autocorrelations are negligible for all parameters except k which is heavily autocorrelated. This is easily resolved for this example since the sampler is fast (we can run the chain much longer very easily). In problems where producing additional samples is more time consuming, such as complicated high dimensional problems, improving the sampler `mixing' can be much more critical.

Why are there such strong autocorrelations for k? The acceptance rate for k proposals (printed out with each MCMC run) are well below 10% which suggests that k values are stagnant more than 90% of the time. A better proposal for the Metropolis-Hastings update of a parameter can help improve acceptance rates which often, in turn, reduces autocorrelations. Try another proposal for k and see how it affects autocorrelations. In complicated problems, carefully constructed proposals can have a major impact on the efficiency of the MCMC algorithm.

Sometimes much is made about running the chain for a certain number of "burnin" iterations before any sampling is done. In theory, longer burnin periods will cause the chain to "forget" its starting value so that the influence of this value will be lessened. While this sounds good, in practice burnin is probably not necessary as long as the initial value is plausible; an amusing rant about this is given by Charlie Geyer, one of the world's authorities on MCMC. Unless you are starting multiple chains and always using the same starting value, you can probably skip the burnin.

Assessing accuracy and determining chain length

There are two important issues to consider when we have draws from an MCMC algorithm: (1) how do we assess the accuracy of our estimates based on the sample (how do we compute Monte Carlo standard errors?) (2) how long do we run the chain before we feel confident that our results are reasonably accurate ?

Regarding (1): Computing standard errors for a Monte Carlo estimate for an i.i.d. (classical Monte Carlo) sampler is easy, as shown for the toy example on estimating P(-1 < X < 0) when X is a Normal(0,1) random variable. Simply obtain the sample standard deviation of the g(x

There are many ways to compute Monte Carlo standard errors. One way to do this is by using the consistent batch means method, for which a function "bm" is available in the "batchmeans" package used above. Using batch means, we can calculate standard error estimates for each of the five parameter estimates:

bm(mchain[1,]) bm(mchain[2,]) bm(mchain[3,]) bm(mchain[4,]) bm(mchain[5,])Are these standard errors acceptable ?

There is a vast literature on different proposals for dealing with the latter issue (how long to run the chain) but they are all heuristics at best. The links at the bottom of this page (see section titled "Some resources") provide references to learn more about suggested solutions. One method that is fairly simple, theoretically justified in some cases and seems to work reasonably well in practice is as follows: run the MCMC algorithm and periodically compute Monte Carlo standard errors. Once the Monte Carlo standard errors are below some (user-defined) threshold, stop the simulation.

Often MCMC users do not run their simulations long enough. For complicated problems run lengths in the millions (or more) are typically suggested (although this may not always be feasible). For our example, try running the MCMC algorithm again, this time for a million iterations:

mchain3 <- mhsampler(dat=Y, MCMCiterations=1e+6)You can now obtain estimates of the posterior distribution of the parameters as before and compute the new Monte Carlo standard error. Note whether the estimates and corresponding MC standard error have changed with respect to the previous sampler.

Making changes to the model

If we were to change the prior distributions on some of the individual parameters, only relatively minor changes may be needed in the program. For instance, if the Inverse Gamma prior on b

An obvious modification to this model would be to allow for more than one change point. A very sophisticated model that may be useful in many change point problems is one where the number of change points is also treated as unknown. In this case the

Some resources

The coda and BOA packages in R implement many well known output analysis techniques. Charlie Geyer's MCMC package in R is another free resource. There is also MCMC software from the popular Bugs/WINBugs and OpenBugs projects.

In addition to deciding how long to run the sampler and how to compute Monte Carlo standard error, there are many possibilities for choosing how to update the parameters and more sophisticated methods used to make the Markov chain move around the posterior distribution efficiently. The literature on such methods is vast. Murali Haran has provided a list of references as a starting point.