Random
numbers in R
It is often necessary to simulate random numbers in R. There are
many functions available to accomplish this and related tasks
such as evaluating the density, distribution function, and quantile function
of a distribution.
Distributions
intrinsic to R
R handles many common distributions easily. To see a list, type
help.search("distribution",package="stats")
Let's consider the well-known normal distribution as an example:
?Normal
The four functions 'rnorm', 'dnorm',
'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential
distribution function), the normal cumulative distribution function (CDF), and
the inverse of the normal CDF (also called the quantile function), respectively. Almost all
of the other distributions have similar sets of four functions. The 'r' versions are
rbeta,
rbinom,
rcauchy,
rchisq,
rexp,
rf,
rgamma,
rgeom,
rhyper,
rlogis,
rlnorm,
rmultinom,
rnbinom,
rnorm,
rpois,
rsignrank,
rt,
runif,
rweibull, and
rwilcox
(there is no rtukey because generally only
ptukey and qtukey
are needed).
As an example, suppose we wish
to simulate a vector of 10
independent, standard (i.e., mean 0 and standard deviation 1) normal random variables. We use the
rnorm function for this
purpose, and its defaults are mean=0 and standard deviation=1. Thus, we may simply type
rnorm(10)
Suppose we wish to simulate a large number of normal random variables with mean 10 and
standard deviation 3, then check a histogram against
two normal density functions, one based on the true parameters and one based on estimates, to
see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters
red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that
we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale
as the density functions (freq=FALSE).
z=rnorm(200000, mean=10, sd=3)
hist(z,freq=FALSE,nclass=100)
x=seq(min(z),max(z),len=200)
lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3)
lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))
We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean,
a common cutoff used by physical scientists. We can also see the true theoretical proportion:
sum(abs((z-10)/3)>3)/length(z)
2*pnorm(-3)
In the first line above, we are using
sum
to count the number of TRUE's in the logical vector (abs((z-10)/3)>3).
This works because logical values are coerced to 0's and 1's when necessary.
The function
dnorm
has a closed form: With mean=0 and sd=1, dnorm(x)
equals exp(-x2/2)/sqrt(2*pi). By contrast,
the CDF, given by
pnorm,
has no closed form and must be numerically approximated. By
definition, pnorm(x) equals the integral of dnorm(t) as t ranges from minus infinity to x. To find a p-value
(i.e., the probability of observing a statistic more extreme than the one actually observed), we use
pnorm; to construct a confidence interval (i.e., a range of reasonable values for the true parameter),
we use the inverse,
qnorm.
pnorm(1:3)-pnorm(-(1:3))
qnorm(c(.05,.95))
The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate
proportions lying within 1, 2, and 3 standard deviations from the mean). The second
line gives the critical values used to construct a 90% confidence interval for a parameter when its
estimator is approximately normally distributed.
Let us now briefly consider an example of a discrete distribution,
which means a distribution on a finite or countably
infinite set (as opposed to a continuous distribution like the normal).
The
Poisson
distribution, which has a single real-valued parameter lambda, puts all of
its probability mass on the nonnegative integers. A Poisson distribution, often used
to model data consisting of counts, has mean
and variance both equal to lambda.
k=0:10
dpois(k,lambda=2.5) # or equivalently,
exp(-2.5)*2.5^k/factorial(k)
Next, simulate some Poisson variables:
x=rpois(10000,lambda=2.5)
table(x)
mean(x)
var(x)
The power-law or Pareto distribution
A commonly used distribution in astrophysics is the power-law distribution,
more commonly known in the statistics literature as the
Pareto distribution. There are no built-in R functions for dealing with this
distribution, but because it is an extremely simple distribution it is easy
to write such functions.
The density function for the Pareto is f(x)=aba/x(a+1) for x>b.
Here, a and b are fixed positive parameters, where
b is the minimum possible value.
As an example, consider the log N = -1.5 * log S
relationship, where S is the apparent brightness and N
is the number of standard candles randomly located in transparent space. Thus, a Pareto
distribution with (a+1) = 1.5 is a reasonable, if simplistic,
model for the brightness of observed standard
candles in space. The b parameter merely reflects the choice of units of measurement.
It turns out that a Pareto random variable is simply b*exp(X), where
X is an
exponential
random variable with rate=a (i.e., with mean=1/a).
However, rather than exploiting this simple relationship,
we wish to build functions for the Pareto distribution from
scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.
dpareto=function(x, a=0.5, b=1) a*b^a/x^(a+1)
Next, we integrate the density function to obtain the distribution function,
which is
F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
ppareto=function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)
Note that (x > b) in the above function
is coerced to numeric, either 0 or 1.
Inverting the distribution function gives the quantile function. The following
simplistic function is wrong unless 0<u<1, so a better-designed function should do some
error-checking.
qpareto=function(u, a=0.5, b=1) b/(1-u)^(1/a)
Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function
is applied to a uniform random variable, the result is a random variable with the desired distribution:
rpareto=function(n, a=0.5, b=1) qpareto(runif(n),a,b)
Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of
a function may be given a default value, which is used whenever the user calls the function without specifying
the value of this parameter. Also note that each of the above functions consists of only a single line; however,
longer functions may be created by enclosing them inside curly braces { }.
A few simple plots
The commands below create plots related to the four functions just created.
par(mfrow=c(2,2))
x=seq(1,50,len=200)
plot(x,dpareto(x),type="l")
plot(x,ppareto(x),type="l",lty=2)
u=seq(.005,.9,len=200)
plot(u,qpareto(u),type="l",col=3)
z=rpareto(200)
dotchart(log10(z), main="200 random logged Pareto deviates",
cex.main=.7)
par(mfrow=c(1,1))
The above commands illustrate some of the many plotting capabilities of R.
The par
function sets many graphical
parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix
of plots, set here to two rows and two columns. In the
plot
commands, 'type' is set here to
"l" for a line plot; other common options are "p" for points (the default),
"b" for connected dots, and "n" for nothing (to create
axes only). Other options used:
'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red,
3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type
'? par'
to see a list of the many plotting parameters and options.