

   BBoooottssttrraapp RReessaammpplliinngg

        boot(data, statistic, R, sim="ordinary", stype="i",
             strata=rep(1,n), L=NULL, m=0, weights=NULL,
             ran.gen=function(d, p) d, mle=NULL, ...)

   AArrgguummeennttss::

       data: The data as a vector, matrix or dataframe.  If it
             is a matrix or dataframe then each row is consid-
             ered as one multivariate observation.

   statistic: A function which when applied to data returns a
             vector containing the statistic(s) of interest.
             When `sim="parametric"', the first argument to
             `statistic' must be the data.  For each replicate
             a simulated dataset returned by `ran.gen' will be
             passed.  In all other cases `statistic' must take
             at least two arguments.  The first argument passed
             will always be the original data. The second will
             be a vector of indices, frequencies or weights
             which define the bootstrap sample.  Further, if
             predictions are required, then a third argument is
             required which would be a vector of the random
             indices used to generate the bootstrap predic-
             tions.  Any further arguments can be passed to
             `statistic' through the `...{}' argument.

          R: The number of bootstrap replicates.  Usually this
             will be a single positive integer.  For importance
             resampling, some resamples may use one set of
             weights and others use a different set of weights.
             In this case `R' would be a vector of integers
             where each component gives the number of resamples
             from each of the rows of weights.

        sim: A character string indicating the type of simula-
             tion required.  Possible values are `"ordinary"'
             (the default), `"parametric"', `"balanced"',
             `"permutation"', or `"antithetic"'.  Importance
             resampling is specified by including importance
             weights; the type of importance resampling must
             still be specified but may only be `"ordinary"' or
             `"balanced"' in this case.

      stype: A character string indicating what the second
             argument of statistic represents.  Possible values
             of stype are `"i"' (indices - the default), `"f"'
             (frequencies), or `"w"' (weights).

     strata: An integer vector or factor specifying the strata
             for multi-sample problems.  This may be specified
             for any simulation, but is ignored when `sim' is
             `"parametric"'. When `strata' is supplied for a
             nonparametric bootstrap, the simulations are done
             within the specified strata.

          L: Vector of influence values evaluated at the obser-
             vations.  This is used only when `sim' is `"anti-
             thetic"'.  If not supplied, they are calculated
             through a call to `empinf'.  This will use the
             infinitesimal jackknife provided that `stype' is
             `"w"', otherwise the usual jackknife is used.

          m: The number of predictions which are to be made at
             each bootstrap replicate.  This is most useful for
             (generalized) linear models.  This can only be
             used when `sim' is `"ordinary"'.  `m' will usually
             be a single integer but, if there are strata, it
             may be a vector with length equal to the number of
             strata, specifying how many of the errors for pre-
             diction should come from each strata.  The actual
             predictions should be returned as the final part
             of the output of `statistic', which should also
             take a vector of indices of the errors to be used
             for the predictions.

    weights: Vector or matrix of importance weights. If a vec-
             tor then it should have as many elements as there
             are observations in `data'.  When simulation from
             more than one set of weights is required,
             `weights' should be a matrix where each row of the
             matrix is one set of importance weights.  If
             `weights' is a matrix then `R' must be a vector of
             length `nrow(weights)'.  This parameter is ignored
             if `sim' is not `"ordinary"' or `"balanced"'.

    ran.gen: This function is used only when `sim' is `"para-
             metric"' when it describes how random values are
             to be generated.  It should be a function of two
             arguments.  The first argument should be the
             observed data and the second argument consists of
             any other information needed (e.g. parameter esti-
             mates).  The second argument may be a list, allow-
             ing any number of items to be passed to `ran.gen'.
             The returned value should be a simulated data set
             of the same form as the observed data which will
             be passed to statistic to get a bootstrap repli-
             cate.  It is important that the returned value be
             of the same shape and type as the original
             dataset.  If `ran.gen' is not specified, the
             default is a function which returns the original
             `data' in which case all simulation should be
             included as part of `statistic'.  Use of
             `sim="parametric"' with a suitable `ran.gen'
             allows the user to implement any types of nonpara-
             metric resampling which are not supported
             directly.

        mle: The second argument to be passed to `ran.gen'.
             Typically these will be maximum likelihood esti-
             mates of the parameters.  For efficiency `mle' is
             often a list containing all of the objects needed
             by `ran.gen' which can be calculated using the
             original data set only.

        ...: Any other arguments for `statistic' which are
             passed unchanged each time it is called.  Any such
             arguments to `statistic' must follow the arguments
             which `statistic' is required to have for the sim-
             ulation.

   DDeessccrriippttiioonn::

        Generate `R' bootstrap replicates of a statistic
        applied to data.  Both parametric and nonparametric
        resampling are possible.  For the nonparametric boot-
        strap, possible resampling methods are the ordinary
        bootstrap, the balanced bootstrap, antithetic resam-
        pling, and permutation.  For nonparametric multi-sample
        problems stratified resampling is used.  This is speci-
        fied by including a vector of strata in the call to
        boot.  Importance resampling weights may be specified.

   DDeettaaiillss::

        The statistic to be bootstrapped can be as simple or
        complicated as desired as long as its arguments corre-
        spond to the dataset and (for a nonparametric boot-
        strap) a vector of indices, frequencies or weights.
        `statistic' is treated as a black box by the `boot'
        function and is not checked to ensure that these condi-
        tions are met.

        The first order balanced bootstrap is described in
        Davison, Hinkley and Schechtman (1986).  The antithetic
        bootstrap is described by Hall (1989) and is experimen-
        tal, particularly when used with strata.  The other
        non-parametric simulation types are the ordinary boot-
        strap (possibly with unequal probabilities), and permu-
        tation which returns random permutations of cases.  All
        of these methods work independently within strata if
        that argument is supplied.

        For the parametric bootstrap it is necessary for the
        user to specify how the resampling is to be conducted.
        The best way of accomplishing this is to specify the
        function `ran.gen' which will return a simulated data
        set from the observed data set and a set of parameter
        estimates specified in `mle'.

   VVaalluuee::

        The returned value is an object of class `"boot"', con-
        taining the following components :

         t0: The observed value of `statistic' applied to
             `data'.

          t: A matrix with `R' rows each of which is a boot-
             strap replicate of `statistic'.

          R: The value of `R' as passed to `boot'.

       data: The `data' as passed to `boot'.

       seed: The value of `.Random.seed' when `boot' was
             called.

   statistic: The function `statistic' as passed to `boot'.

        sim: Simulation type used.

      stype: Statistic type as passed to `boot'.

       call: The original call to `boot'.

     strata: The strata used.  This is the vector passed to
             `boot', if it was supplied or a vector of ones if
             there were no strata.  It is not returned if `sim'
             is `"parametric"'.

    weights: The importance sampling weights as passed to
             `boot' or the empirical distribution function
             weights if no importance sampling weights were
             specified.  It is omitted if `sim' is not one of
             `"ordinary"' or `"balanced"'.

     pred.i: If predictions are required (`m>0') this is the
             matrix of indices at which predictions were calcu-
             lated as they were passed to statistic.  Omitted
             if `m' is `0' or `sim' is not `"ordinary"'.

          L: The influence values used when `sim' is `"anti-
             thetic"'.  If no such values were specified and
             `stype' is not `"w"' then `L' is returned as con-
             secutive integers corresponding to the assumption
             that data is ordered by influence values.  This
             component is omitted when `sim' is not `"anti-
             thetic"'.

    ran.gen: The random generator function used if `sim' is
             `"parametric"'. This component is omitted for any
             other value of `sim'.

        mle: The parameter estimates passed to `boot' when
             `sim' is `"parametric"'.  It is omitted for all
             other values of `sim'.

   RReeffeerreenncceess::

        There are many references explaining the bootstrap and
        its variations.  Among them are :

        Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced
        importance resampling for the bootstrap. Annals of
        Statistics, 21, 286-298.

        Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Meth-
        ods and Their Application. Cambridge University Press.

        Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986)
        Efficient bootstrap simulation. Biometrika, 73,
        555-566.

        Efron, B. and Tibshirani, R. (1993) An Introduction to
        the Bootstrap.  Chapman  Hall.

        Gleason, J.R. (1988) Algorithms for balanced bootstrap
        simulations.
         American Statistician, 42, 263-266.

        Hall, P. (1989) Antithetic resampling for the boot-
        strap. Biometrika, 73, 713-724.

        Hinkley, D.V. (1988) Bootstrap methods (with Discus-
        sion).  Journal of the  Royal Statistical Society, B,
        50, 312-337, 355-370.

        Hinkley, D.V. and Shi, S. (1989) Importance sampling
        and the nested bootstrap.  Biometrika, 76, 435-446.

        Johns M.V. (1988) Importance sampling for bootstrap
        confidence intervals.  Journal of the American Statis-
        tical Association, 83, 709-714.

        Noreen, E.W. (1989) Computer Intensive Methods for
        Testing Hypotheses.  John Wiley  Sons.

   SSeeee AAllssoo::

        `boot.array', `boot.ci', `boot.object', `censboot',
        `empinf', `jack.after.boot', `tilt.boot', `tsboot'

   EExxaammpplleess::

        # usual bootstrap of the ratio of means using the city data
        data(city)
        ratio <- function(d, w)
             sum(d$x * w)/sum(d$u * w)
        boot(city, ratio, R=999, stype="w")

        # Stratified resampling for the difference of means.  In this
        # example we will look at the difference of means between the final
        # two series in the gravity data.
        data(gravity)
        diff.means <- function(d, f)
        {    n <- nrow(d)
             gp1 <- 1:table(as.numeric(d$series))[1]
             m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
             m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
             ss1 <- sum(d[gp1,1]^2 * f[gp1]) -
                    (m1 *  m1 * sum(f[gp1]))
             ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) -
                    (m2 *  m2 * sum(f[-gp1]))
             c(m1-m2, (ss1+ss2)/(sum(f)-2))
        }
        grav1 <- gravity[as.numeric(gravity[,2])>=7,]
        boot(grav1, diff.means, R=999, stype="f", strata=grav1[,2])

        #  In this example we show the use of boot in a prediction from
        #  regression based on the nuclear data.  This example is taken
        #  from Example 6.8 of Davison and Hinkley (1997).  Notice also
        #  that two extra arguments to statistic are passed through boot.
        data(nuclear)
        nuke <- nuclear[,c(1,2,5,7,8,10,11)]
        nuke.lm <- glm(log(cost)~date+log(cap)+ne+ ct+log(cum.n)+pt, data=nuke)
        nuke.diag <- glm.diag(nuke.lm)
        nuke.res <- nuke.diag$res*nuke.diag$sd
        nuke.res <- nuke.res-mean(nuke.res)

        #  We set up a new dataframe with the data, the standardized
        #  residuals and the fitted values for use in the bootstrap.
        nuke.data <- data.frame(nuke,resid=nuke.res,fit=fitted(nuke.lm))

        #  Now we want a prediction of plant number 32 but at date 73.00
        new.data <- data.frame(cost=1, date=73.00, cap=886, ne=0,
                               ct=0, cum.n=11, pt=1)
        new.fit <- predict(nuke.lm, new.data)

        nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
        {
             assign(".inds", inds, envir=.GlobalEnv)
             lm.b <- glm(fit+resid[.inds] ~date+log(cap)+ne+ct+
                  log(cum.n)+pt, data=dat)
             pred.b <- predict(lm.b,x.pred)
             remove(".inds", envir=.GlobalEnv)
             c(coef(lm.b), pred.b-(fit.pred+dat$resid[i.pred]))
        }

        nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1,
             fit.pred=new.fit, x.pred=new.data)
        #  The bootstrap prediction error would then be found by
        mean(nuke.boot$t[,8]^2)
        #  Basic bootstrap prediction limits would be
        new.fit-sort(nuke.boot$t[,8])[c(975,25)]

        #  Finally a parametric bootstrap.  For this example we shall look
        #  at the air-conditioning data.  In this example our aim is to test
        #  the hypothesis that the true value of the index is 1 (i.e. that
        #  the data come from an exponential distribution) against the
        #  alternative that the data come from a gamma distribution with
        #  index not equal to 1.
        air.fun <- function(data)
        {    ybar <- mean(data$hours)
             para <- c(log(ybar),mean(log(data$hours)))
             ll <- function(k) {
                  if (k <= 0) out <- 1e200 # not NA
                  else out <- lgamma(k)-k*(log(k)-1-para[1]+para[2])
                 out
             }
             khat <- nlm(ll,ybar^2/var(data$hours))$estimate
             c(ybar, khat)
        }

        air.rg <- function(data, mle)
        #  Function to generate random exponential variates.  mle will contain
        #  the mean of the original data
        {    out <- data
             out$hours <- rexp(nrow(out), 1/mle)
             out
        }

        data(aircondit)
        air.boot <- boot(aircondit, air.fun, R=999, sim="parametric",
             ran.gen=air.rg, mle=mean(aircondit$hours))

        # The bootstrap p-value can then be approximated by
        sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)

