ReviewThe bootstrap: A technique for data-driven statistics. Using computer-intensive analyses to explore experimental data
Introduction
In a 1994 review Altman and Goodman [1] identified influential statistical articles and the time pattern of their citations in the medical literature. One such article described the bootstrap [2]—the topic of this review. I used an Ovid Technologies Medline keyword search [“bootstrap” or “resampling”] for the period 1995 to 2004 to assess the subsequent pattern of citations in the medical literature and recovered 1679 references.1 These citations increased year-by-year since 1995 (Fig. 1). I also performed a full-text search (numbers of citations in parenthesis) of research articles in the journals BMJ (48), JAMA (51), Lancet (52), and the New England Journal of Medicine (45) over the same period (due to archive limitations some of these searches were for shorter periods). These findings suggest that bootstrap methods are increasingly being utilised in the medical literature. These techniques have also found wide application in such diverse fields as astronomy, biology, economics, engineering, genetics, molecular biology, and finance [3].
In exploring aspects of the bootstrap in this review I largely used S-Plus, version 6.2 (Insightful Corp.), R, version 1.9 [4], a series of available S-Plus and R libraries [5], [6], [7], Confidence Interval Calculator, version 2 [8], [9], Analyse-it for Microsoft Excel, version 1.72 [10], and CBstat, version 5 [11]. In passing, it is worth noting that it is perfectly possible to implement the bootstrap and jackknife within a spreadsheet program [12], [13], [14] although the random number generator may not be entirely satisfactory (see Random number generators).
All the data used to illustrate this review are typical of the type of data analysed in the practice of clinical chemistry. These data were obtained from Hand et al. [15], Harris and Boyd [16], Beck and Shultz [17], Krzanowski [18], results of proficiency testing of enzyme determinations in Ontario [19], [20], and a study of the rate of removal of lactate dehydrogenase-1 (LD-1) from serum following a myocardial infarction [21].
Section snippets
Parametric and nonparametric statistics
The normal (Gaussian) distribution is characterised by two parameters—the mean and S.D. Statistical methods that assume the Gaussian distribution of data are called parametric. Of course, other probability distributions whose characteristics are defined by one or more parameters can also be analysed by appropriate parametric methods. Nonparametric or distribution-free [22] statistical techniques are used to analyse data that do not assume a particular family of probability distributions. It is
The bootstrap process
What is the bootstrap? Essentially, a set of data is randomly resampled (with replacement, i.e., when an item is sampled it is immediately replaced) multiple times (as many as 10,000 or more times) and statistical conclusions are drawn from this data collection. Excellent elementary accounts of the theory have been provided by Simon and Bruce [24], [25]. More advanced accounts are found in a 1983 Scientific American article by Diaconis and Efron [26] and a 1991 Science article by Efron and
Three bootstrap methods
An early application of the bootstrap was the calculation of confidence intervals of non-Gaussian distributions. By contrast, confidence intervals of Gaussian distributions (or of some other defined distributional framework) were calculated by statistical methods appropriate to the particular distribution being examined. In dealing with confidence intervals of a non-Gaussian univariate population two measurements are of interest—the confidence interval of the median and the confidence interval
The jackknife
The jackknife [44], [45] preceded the concept of the bootstrap The name derives from JW Tukey's suggestion, in an unpublished 1958 manuscript [46], that “The approach … shares two characteristics with a Boy Scout jackknife: (i) wide applicability to many different problems, and (ii) inferiority to special tools for those problems for which special tools have been designed and built” [47].
Consider a data set x = (x1, x2, ….., xn) and an estimator θˆ = s(x). Let x(i) indicate the data set remaining
The combinatorial algebra of the bootstrap
The combinatorial algebra of the bootstrap is quite different from the usual process of sampling without replacement. The illustrative sample consists of 10 atoms numbered from 1 to 10 (Table 1A). It is evident that resampling with replacement produces a very different sample—some atoms are not retrieved at all while others are retrieved several times, such as atoms 3 and 4 when they are present 2- and 3-fold with B = 1 This set of resampled observations constitute a bootstrap pseudo-sample. When
Random number generators
The bootstrap process depends on the random selection of items from the data set using a random number generator as the basis for the selection of the bootstrap pseudo-sample. Thus in the example shown in Table 1A for B = 1 with a pre-defined seed, the atoms 5, 7, and 8 are not selected at all while atoms 1, 2, 6, 9, and 10 are each selected once. The process of random number generation is fraught with theoretical and practical problems [52], [53] and it is probably safe to suggest that there is
Confidence intervals of an L-statistic (univariate data)
The confidence interval of the mean or the median is often required. For the sample mean the standard nonparametric statistical procedure is:
- •
calculate the sample mean and S.D.,
- •
calculate the sample S.E.M.,
- •
obtain the appropriate value of Student's t for n − 1 degrees of freedom and the confidence interval required [9],
- •
calculate the confidence interval x¯ ± (t × S.E.M.).
By contrast, the bootstrap determination is easier and only requires:
- •
bootstrap type (data, statistic = (median, appropriate CI), value of
Journal articles (reviews or tutorials, ordered by year of publication):
Mathematical viewpoint [113], [114]. General biological applications [115], [116], [117], [118], [119], [120], [121], [122], [123]. Applications in specific disciplines—psychophysiology [47], calibration in analytical chemistry [124], cost-effective analysis [125], pharmacoeconomic cost analysis [126], reference interval estimation [127], imprecision profiles in biochemical analysis [128], environmental research [129], probabilistic sensitivity analysis [130], screening for early renal failure
Concluding remarks
The examples illustrated in this article merely touch the surface of the potential of the bootstrap and the jackknife but it is evident that these techniques can supplement and extend conventional statistical thinking. Some of the elementary uses of bootstrapping were illustrated by considering the calculation of confidence intervals such as for reference ranges or for experimental data sets, hypothesis testing such as comparing experimental findings, linear regression and correlation when
Acknowledgements
I am grateful to Dr Frank Harrell (Vanderbilt University), Dr Robert Platt (McGill University), Dr Tim Hesterberg (Insightful Corporation), and Elizabeth Atkinson (Mayo Clinic) for their goodwill and patience in constructively responding to my questions, and to the technical support staff at Insightful Corporation for their advice and assistance.
References (144)
An add-in implementation of the RESAMPLING syntax under Microsoft EXCEL
Comput Methods Programs Biomed
(2000)Thoughts on pseudorandom number generators
J Comput Appl Math
(1990)- et al.
Reference intervals: an update
Clin Chim Acta
(2003) The area above the ordinal dominance graph and the area below the receiver operating graph
J Math Psychol
(1975)- et al.
Transfer of technology from statistical journals to the biomedical literature. Past trends and future predictions
JAMA
(1994) Bootstrap methods: another look at the jackknife
Ann Stat
(1979)Bootstrap methods
- R Development Core Team. R: A language and environment for statistical computing. http://www.R-project.org Accessed...
- Harrell F, Alzola C. An Introduction to S and the Hmisc and Design libraries....