A Markov chain Monte Carlo example
Summer School in Astrostatistics, Center for Astrostatistics, Penn State University
Adapted from notes by Murali Haran, Dept. of Statistics, Penn State University
We describe a model that is easy to specify but requires samples
from a relatively complicated distribution for which classical Monte
Carlo sampling methods are impractical. We describe how to implement a
Markov chain Monte Carlo (MCMC) algorithm for this example.
The purpose of this is twofold: First to illustrate how MCMC
algorithms are easy to implement (at least in principle) in situations
where classical Monte Carlo methods do not work and second to provide
a glimpse of practical MCMC implementation issues.
Datasets and other files used in this tutorial:
pdf files referred to in this tutorial that give technical details:
Monte Carlo methods are a collection of techniques that use
pseudo-random (computer simulated) values to approximate solutions to
mathematical problems. In this tutorial, we will focus on using Monte
Carlo for Bayesian inference. In particular, we will use it for the
evaluation of expectations with respect to a probability
distribution. Monte Carlo methods can also be used for a variety of
other purposes, including estimating maxima or minima of functions (as
in likelihood-based inference) but we will not discuss these here.
Monte Carlo works as follows: Suppose we want to estimate an
expectation of a function g(x) with respect to the probability
distribution f. We denote this desired quantity m= E f
g(x). Often, m is analytically intractable (the integration or
summation required is too complicated). A Monte Carlo estimate of m is
obtained by simulating N pseudo-random values from the distribution f,
say X1,X2,..,XN and simply taking the
average of g(X1),g(X2),..,g(XN) to
estimate m. As N (number of samples) gets large, the estimate
converges to the true expectation m.
A toy example to calculate P(-1 < X < 0) when X is a Normal(0,1) random variable:
xs <- rnorm(10000) # simulate 10,000 draws from N(0,1)
xcount <- sum((xs>-1) & (xs<0)) # count number of draws between -1 and 0
xcount/10000 # Monte Carlo estimate of probability
pnorm(0) - pnorm(-1) # Compare it to R answer (cdf at 0) - (cdf at -1)
Importance sampling: Another powerful technique for
estimating expectations is importance sampling, where we produce
draws from a different distribution, say q, and compute a specific
weighted average of these draws to obtain estimates of expectations
with respect to f. In this case, A Monte Carlo estimate of m is
obtained by simulating N pseudo-random values from the distribution
q, say Y1,Y2,..,YN and simply
taking the average of g(Y1)w1,
g(Y2)w2, ..., g(YN)wN
to estimate m, where w1, w2, ..., wN
are weights obtained as follows: wi =
f(Y1)/q(Y1). As N (the number of samples) gets
large, the estimate converges to the true expectation m. Often, when
normalizing constants for f or q are unknown, and for numerical
stability, the weights are 'normalized' by dividing the above
weights by the sum of all weights (sum over
w1, ..., wN).
Importance sampling is powerful in a number of situations, including:
A toy example to calculate P(4 < X) when X is a Normal(0,1) random variable:
- When expectations with respect to several different
distributions (say f1,..,fp) are of
interest. All these expectations can, in principle, be estimated by
using just a single set of samples!
When rare event
probabilities are of interest so ordinary Monte Carlo would take a
huge number of samples for accurate estimates. In such cases,
selecting q appropriately can produce much more accurate estimates
with far fewer samples.
xs <- rnorm(10000) # simulate 10,000 draws from N(0,1)
mean(xs>4) # Estimate probability empirically
1-pnorm(4) # Compare it to R answer
ys <- rexp(10000)+4 # Try importance sampling with q(x)=exp(-(x-4)) for x>4.
mean(dnorm(ys)/exp(4-ys)) # Much more precise estimator!
Discussions of importance sampling in astronomical Bayesian
computation appear in papers by Lewis & Bridle
and Trotta for
cosmological parameter estimation and Ford for
extrasolar planet modeling.
R has random number generators for most standard distributions and
there are many more general algorithms (such as rejection sampling)
for producing independent and identically distributed (i.i.d.) draws
from f. Another very general approach for producing non-i.i.d. draws
(approximately) from f is the Metropolis-Hastings algorithm.
Markov chain Monte Carlo : For complicated distributions,
producing pseudo-random i.i.d. draws from f is often infeasible. In
such cases, the Metropolis-Hastings algorithm is used to produce a
Markov chain say X1,X2,..,XN where
the Xi's are
dependent draws that are approximately from the
desired distribution. As before, the average of
g(X1), g(X2), ..., g(XN) is an estimate
that converges to m as N gets large. The
Metropolis-Hastings algorithm is very general and hence very
useful. In the following example we will see how it can be used for
inference for a model/problem where it would otherwise be impossible
to compute desired expectations.
Problem and model description
First, a five minute review of Bayesian inference:
We begin by specifying a probability model for our data Y by assuming
it is generated from some distribution h(theta;Y), where theta is a
set of parameters for that distribution. This is written Y~h(theta).
We want to infer theta from the fixed, observed dataset Y. First,
consider likelihood inference. We find a value of theta where the
likelihood L(theta;Y) (which is obtained from the probability
distribution h(theta;Y)) is maximized; this is the maximum likelihood
estimate (MLE) for theta. Now consider Bayesian inference. We assume
a prior distribution for theta, p(theta), based on our previous
knowledge. This prior may be based on astrophysical insights (e.g. no
source can have negative brightness), past astronomical observation
(e.g. stars have masses between 0.08-150 solar masses), and/or
statistical considerations (e.g. uniform or Jeffreys priors) when it
is difficult to obtain good prior information. Inference is based on
the posterior distribution Pi(theta | Y) which is proportional to the
product of the likelihood and the prior. It is only proportional
to this product because in reality Bayes theory requires that we
write down a denominator (the integral of the product of the
likelihood and prior over the parameter space). Fortunately, Markov
chain Monte Carlo algorithms avoid computation of this denominator
while still producing samples from the posterior Pi(theta | Y). Note
that the MCMC methods discussed here are often associated with
Bayesian computation, but are really independent methods which can be
used for a variety of challenging numerical problems. Essentially, any
time samples from a complicated distribution are needed, MCMC may be
Our example uses a dataset
from the Chandra Orion Ultradeep Project (COUP).
This is a time series of X-ray emission from a flaring young star in the Orion
information on this dataset is available at:
CASt Chandra Flares data set .
The raw data, which arrives approximately according to a Poisson
process, gives the individual photon arrival times (in seconds) and
their energies (in keV). The processed data we consider here is
obtained by grouping the events into evenly-spaced time bins (10,000
We assume that the Poisson process is piecewise homogeneous with a
single change point.
Our goal for this data analysis is to identify the change point and
estimate the intensities of the Poisson process before and after the
We describe a Bayesian model for this change point problem (Carlin
and Louis, 2000). Let Yt be the number of occurrences of some event at
time t. The process is observed for times 1 through n and we assume
that there is a change at time k, i.e., after time k, the event counts
are significantly different (higher or lower than before). The
mathematical description of the model is provided in changePointExample.pdf.
While this is a simple model, it is adequate for illustrating
some basic principles for constructing an MCMC algorithm.
We first read in the data:
chptdat = read.table("http://www.stat.psu.edu/~dhunter/astrostatistics/mcmc/COUP551_rates.dat",skip=1)
Note: This data set is just a convenient subset of the actual data
set (see reference above.)
We can begin with a simple time series plot as exploratory
Y=chptdat[,2] # store data in Y
ts.plot(Y,main="Time series plot of change point data")
The plot suggests that the change point may be around 10.
Setting up the MCMC algorithm
Our goal is to simulate
multiple draws from the posterior distribution which is a
multidimensional distribution known only upto a (normalizing)
constant. From this multidimensional distribution, we can easily
derive the conditional distribution of each of the individual
parameters (one dimension at a time). This is described, along with
a description of the Metropolis-Hastings algorithm in changePointExample.pdf.
Programming an MCMC algorithm in R
We will need an editor for our program. For instance, we can use
emacs (with ESS, or emacs speaks statistics, installed).
Alternatively, R comes with its own editor, which is good enough
for our purposes.
Please save code from
MCMC template in R into a file and open this
file using the editor. Save this file as MCMCchpt.R .
To load the program from the file MCMCchpt.R we use the "source" command.
(It may be helpful to type:
to set the default directory to the place where you can save your files.)
source("MCMCchpt.R") # with appropriate filepathname
We can now run the MCMC algorithm:
mchain <- mhsampler(dat=Y, MCMCiterations=5000) # call the function with appropriate arguments
Alternatively, we can run the function and keep the value of k fixed at
10 for each iteration:
mchain2 <- mhsampler(dat=Y, MCMCiterations=5000, kfixed=TRUE)
MCMC output analysis
Now that we have output from our sampler, we can treat these
samples as data from which we can estimate quantities of
interest. For instance, to estimate the expectation of a marginal
distribution for a particular parameter, we would simply average all
draws for that parameter so to obtain an estimate of E(theta):
mean(mchain[1,]) # obtain mean of first row (thetas)
To get estimates for means for all parameters:
apply(mchain,1,mean) # compute means by row (for all parameters at once)
apply(mchain,1,median) # compute medians by row (for all parameters at once)
To obtain an estimate of the entire posterior distribution:
plot(density(mchain[1,]),main="smoothed density plot for theta posterior")
plot(density(mchain[2,]),main="smoothed density plot for lambda posterior")
hist(mchain[3,],main="histogram for k posterior")
To find the (posterior) probability that lambda is greater than 10
You can run similar types of analyses on the mchain2 object, though in
that case all of the k values (in the 3rd row) are fixed at 10.
You can also study how your estimate for the expectation of the posterior
distribution for k changes with each iteration.
install.packages("batchmeans") # We need this package for the estvssamp function
library(batchmeans) # Load the package (different from installing it)
We would like to assess whether our Markov chain is moving around
quickly enough to produce good estimates (this property is often
called 'good mixing'). While this is in general difficult to do
estimates of the autocorrelation in the samples is an informal but
useful check. To obtain sample autocorrelations we use the acf plot function:
acf(mchain[1,],main="acf plot for theta")
acf(mchain[2,],main="acf plot for lambda")
acf(mchain[3,],main="acf plot for k")
acf(mchain[4,],main="acf plot for b1")
acf(mchain[5,],main="acf plot for b2")
If the samples are heavily autocorrelated we should rethink our
sampling scheme or, at the very least, run the chain for much
longer. Note that the autocorrelations are negligible for all
parameters except k which is heavily autocorrelated. This is easily
resolved for this example since the sampler is fast (we can run the
chain much longer very easily). In problems where producing additional
samples is more time consuming, such as complicated high dimensional
problems, improving the sampler `mixing' can be much more critical.
Why are there such strong autocorrelations for k? The acceptance rate
for k proposals (printed out with each MCMC run) are well below 10%
which suggests that k values are stagnant more than 90% of the time. A
better proposal for the Metropolis-Hastings update of a parameter can
help improve acceptance rates which often, in turn, reduces
autocorrelations. Try another proposal for k and see how it affects
autocorrelations. In complicated problems, carefully constructed
proposals can have a major impact on the efficiency of the MCMC
How do we choose starting values? In general, any value we
believe to be reasonable under the posterior distribution will
suffice. You can experiment with different starting values
(our function above uses starting values of 1 for all parameters except
k, which is started at n/2).
Sometimes much is made about running the chain for a certain number of
"burnin" iterations before any sampling is done. In theory, longer burnin
periods will cause the chain to "forget" its starting value so that the
influence of this value will be lessened. While this sounds good, in
practice burnin is probably not necessary as long as the initial value is
plausible; an amusing
rant about this
is given by Charlie Geyer, one of the world's authorities on MCMC.
Unless you are starting multiple chains and always using the same starting
value, you can probably skip the burnin.
Assessing accuracy and determining chain length
There are two
important issues to consider when we have draws from an MCMC
algorithm: (1) how do we assess the accuracy of our estimates based
on the sample (how do we compute Monte Carlo standard errors?) (2)
how long do we run the chain before we feel confident that our
results are reasonably accurate ?
Regarding (1): Computing standard errors for a Monte Carlo estimate for
an i.i.d. (classical Monte Carlo) sampler is easy, as shown for the
toy example on estimating P(-1 < X < 0) when X is a Normal(0,1) random
variable. Simply obtain the sample standard deviation of the g(xi)
values and divide by square root of n (the number of samples). Since
Markov chains produce dependent draws, computing precise Monte Carlo
standard errors for such samplers is a very difficult problem in
general. For (2): Draws produced by classical Monte Carlo methods,
such as rejection sampling, are from the correct (exact)
distribution. The MCMC algorithm produces draws that are only
asymptotically from the correct distribution. All the draws we see
after a finite number of iterations are therefore only approximately
from the correct distribution. Determining how long we have to run
the chain before we feel sufficiently confident that the MCMC
algorithm has produced reasonably accurate draws from the
distribution is therefore a very difficult problem. Most rigorous
solutions are too specific or tailored towards relatively simple
situations while more general approaches tend to be heuristic.
There are many ways to compute Monte Carlo standard errors.
One way to do this is by using the
consistent batch means
which a function "bm" is available in the "batchmeans" package used
Using batch means,
we can calculate standard error estimates for each of the five
Are these standard errors acceptable ?
There is a vast literature on different proposals for dealing with the
latter issue (how long to run the chain) but they are all heuristics
at best. The links at the bottom of this page (see section titled
"Some resources") provide references to learn more about suggested
solutions. One method that is fairly simple, theoretically justified
in some cases and seems to work reasonably well in practice is as
follows: run the MCMC algorithm and periodically
compute Monte Carlo standard errors. Once the Monte Carlo standard
errors are below some (user-defined) threshold, stop the
Often MCMC users do not run their simulations long enough. For
complicated problems run lengths in the millions (or more) are
typically suggested (although this may not always be feasible).
our example, try running the MCMC algorithm again, this time for a million
mchain3 <- mhsampler(dat=Y, MCMCiterations=1e+6)
You can now obtain estimates of the posterior
distribution of the parameters as before and compute the new Monte
Carlo standard error. Note whether the estimates and corresponding MC
standard error have changed with respect to the previous sampler.
Making changes to the model
If we were to change the prior distributions on some of the individual
parameters, only relatively minor changes may be needed in the
program. For instance, if the Inverse Gamma prior on
and b2 were
replaced by Gamma(0.01,100) priors on them, we would only have to
change the lines in the code corresponding to the updates of b1
(we would need to perform a Metropolis-Hastings update of each
parameter). The rest of the code would remain unchanged. Modifying the
program to make it sample from the posterior for the
modified model is a useful exercise. For
the modified full conditionals, see "A small modification" near the
end of the
An obvious modification to this model would be to allow for more than
one change point. A very sophisticated model that may be useful in
many change point problems is one where the number of change points is
also treated as unknown. In this case the number of Poisson
parameters (only two of them in our example: theta and lambda) is also
unknown. The posterior distribution is then a mixture over
distributions of varying dimensions (the dimensions change with the
number of change points in the model). This requires an advanced
version of the Metropolis-Hastings algorithm known as
reversible-jump Metropolis Hastings due to Peter Green
packages in R implement many well known output analysis
techniques. Charlie Geyer's
package in R is another free resource. There is also MCMC software from
addition to deciding how long to run the sampler and how to compute
Monte Carlo standard error, there are many possibilities for choosing
how to update the parameters and more sophisticated methods used to
make the Markov chain move around the posterior distribution
efficiently. The literature on such methods is vast.
Murali Haran has provided a
references as a starting point.
Acknowledgment: The model is borrowed from Chapter 5 of "Bayes and Empirical Bayes Methods for Data Analysis" by Carlin and Louis (2000). The data example was provided by Konstantin Getman (Penn State University).