Introduction to Robust Estimation and Hypothesis Testing (eBook)
608 Seiten
Elsevier Science (Verlag)
978-0-08-047053-5 (ISBN)
Introduction to Robust Estimation and Hypothesis Testing, Second Edition, focuses on the practical applications of modern, robust methods which can greatly enhance our chances of detecting true differences among groups and true associations among variables.
* Covers latest developments in robust regression
* Covers latest improvements in ANOVA
* Includes newest rank-based methods
* Describes and illustrated easy to use software
This revised book provides a thorough explanation of the foundation of robust methods, incorporating the latest updates on R and S-Plus, robust ANOVA (Analysis of Variance) and regression. It guides advanced students and other professionals through the basic strategies used for developing practical solutions to problems, and provides a brief background on the foundations of modern methods, placing the new methods in historical context. Author Rand Wilcox includes chapter exercises and many real-world examples that illustrate how various methods perform in different situations.Introduction to Robust Estimation and Hypothesis Testing, Second Edition, focuses on the practical applications of modern, robust methods which can greatly enhance our chances of detecting true differences among groups and true associations among variables.* Covers latest developments in robust regression* Covers latest improvements in ANOVA* Includes newest rank-based methods* Describes and illustrated easy to use software
Chapter 1 Introduction
Introductory statistics courses describe methods for computing confidence intervals and testing hypotheses about means and regression parameters based on the assumption that observations are randomly sampled from normal distributions. When comparing independent groups, standard methods also assume that groups have a common variance, even when the means are unequal, and a similar homogeneity of variance assumption is made when testing hypotheses about regression parameters. Currently, these methods form the backbone of most applied research. There is, however, a serious practical problem: Many journal articles have illustrated that these standard methods can be highly unsatisfactory. Often the result is a poor understanding of how groups differ and the magnitude of the difference. Power can be relatively low compared to recently developed methods, least squares regression can yield a highly misleading summary of how two or more random variables are related (as can the usual correlation coefficient), the probability coverage of standard methods for computing confidence intervals can differ substantially from the nominal value, and the usual sample variance can give a distorted view of the amount of dispersion among a population of subjects. Even the population mean, if it could be determined exactly, can give a distorted view of what the typical subject is like.
Although the problems just described are well known in the statistics literature, many textbooks written for applied researchers still claim that standard techniques are completely satisfactory. Consequently, it is important to review the problems that can arise and why these problems were missed for so many years. As will become evident, several pieces of misinformation have become part of statistical folklore, resulting in a false sense of security when using standard statistical techniques.
1.1 Problems with Assuming Normality
To begin, distributions are never normal. For some this seems obvious, hardly worth mentioning. But an aphorism given by Cramér (1946) and attributed to the mathematician Poincaré remains relevant: “Everyone believes in the [normal] law of errors, the experimenters because they think it is a mathematical theorem, the mathematicians because they think it is an experimental fact”. Granted, the normal distribution is the most important distribution in all of statistics. But in terms of approximating the distribution of any continuous distribution, it can fail to the point that practical problems arise, as will become evident at numerous points in this book. To believe in the normal distribution implies that only two numbers are required to tell us everything about the probabilities associated with a random variable: the population mean μ and population variance σ2. Moreover, assuming normality implies that distributions must be symmetric.
Of course, nonnormality is not, by itself, a disaster. Perhaps a normal distribution provides a good approximation of most distributions that arise in practice, and of course there is the central limit theorem, which tells us that under random sampling, as the sample size gets large, the limiting distribution of the sample mean is normal. Unfortunately, even when a normal distribution provides a good approximation to the actual distribution being studied (as measured by the Kolmogorov distance function, described later), practical problems arise. Also, empirical investigations indicate that departures from normality that have practical importance are rather common in applied work (e.g., M. Hill and Dixon, 1982; Micceri, 1989; Wilcox, 1990a). Even over a century ago, Karl Pearson and other researchers were concerned about the assumption that observations follow a normal distribution (e.g., Hand, 1998, p. 649). In particular, distributions can be highly skewed, they can have heavy tails (tails that are thicker than a normal distribution), and random samples often have outliers (unusually large or small values among a sample of observations). Outliers and heavy-tailed distributions are a serious practical problem because they inflate the standard error of the sample mean, so power can be relatively low when comparing groups. Modern robust methods provide an effective way of dealing with this problem. Fisher (1922), for example, was aware that the sample mean could be inefficient under slight departures from normality.
A classic way of illustrating the effects of slight departures from normality is with the contaminated, or mixed, normal distribution (Tukey, 1960). Let X be a standard normal random variable having distribution Φ(x) = P(X ≤ x). Then for any constant K > 0, Φ(x/K) is a normal distribution with standard deviation K. Let ε be any constant, 0 ≤ ε ≤ 1. The contaminated normal distribution is
(1.1)
which has mean 0 and variance 1 − ε + εK2. (Stigler, 1973, finds that the use of the contaminated normal dates back at least to Newcomb, 1896.) In other words, the contaminated normal arises by sampling from a standard normal distribution with probability 1 − ε; otherwise sampling is from a normal distribution with mean 0 and standard deviation K.
To provide a more concrete example, consider the population of all adults, and suppose that 10% of all adults are at least 70 years old. Of course, individuals at least 70 years old might have a different distribution from the rest of the population. For instance, individuals under 70 might have a standard normal distribution, but individuals at least 70 years old might have a normal distribution with mean 0 and standard deviation 10. Then the entire population of adults has a contaminated normal distribution with ε = 0.1 and K = 10. In symbols, the resulting distribution is
(1.2)
which has mean 0 and variance 10.9. Moreover, Eq. (1.2) is not a normal distribution, verification of which is left as an exercise.
To illustrate problems that arise under slight departures from normality, we first examine Eq. (1.2) more closely. Figure 1.1 shows the standard normal and the contaminated normal probability density function corresponding to Eq. (1.2). Notice that the tails of the contaminated normal are above the tails of the normal, so the contaminated normal is said to have heavy tails. It might seem that the normal distribution provides a good approximation of the contaminated normal, but there is an important difference. The standard normal has variance 1, but the contaminated normal has variance 10.9. The reason for the seemingly large difference between the variances is that σ2 is very sensitive to the tails of a distribution. In essence, a small proportion of the population of subjects can have an inordinately large effect on its value. Put another way, even when the variance is known, if sampling is from the contaminated normal, the length of the standard confidence interval for the population mean, μ, will be over three times longer than it would be when sampling from the standard normal distribution instead. What is important from a practical point of view is that there are location estimators other than the sample mean that have standard errors that are substantially less affected by heavy-tailed distributions. By “measure of location” is meant some measure intended to represent the typical subject or object, the two best-known examples being the mean and the median. (A more formal definition is given in Chapter 2.) Some of these measures have relatively short confidence intervals when distributions have a heavy tail, yet the length of the confidence interval remains reasonably short when sampling from a normal distribution instead. Put another way, there are methods for testing hypotheses that have good power under normality but that continue to have good power when distributions are nonnormal, in contrast to methods based on means. For example, when sampling from the contaminated normal given by Eq. (1.2), both Welch’s and Student’s method for comparing the means of two independent groups have power approximately 0.278 when testing at the .05 level with equal sample sizes of 25 and when the difference between the means is 1. In contrast, several other methods, described in Chapter 5, have power exceeding 0.7.
Figure 1.1 Normal and contaminated normal distributions.
In an attempt to salvage the sample mean, it might be argued that in some sense the contaminated normal represents an extreme departure from normality. The extreme quantiles of the two distributions do differ substantially, but based on various measures of the difference between two distributions they are very similar, as suggested by Figure 1.1. For example, the Kolmogorov distance between any two distributions, F and G, is the maximum value of
the maximum being taken over all possible values of x. (If the maximum does not exist, the supremum, or least upper bound, is used.) If distributions are identical, the Kolmogorov distance is 0, and its maximum possible value is 1, as is evident. Now consider the Kolmogorov distance between the contaminated...
Erscheint lt. Verlag | 22.1.2005 |
---|---|
Sprache | englisch |
Themenwelt | Sachbuch/Ratgeber |
Geisteswissenschaften ► Psychologie ► Allgemeine Psychologie | |
Mathematik / Informatik ► Mathematik ► Statistik | |
Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
Technik | |
ISBN-10 | 0-08-047053-X / 008047053X |
ISBN-13 | 978-0-08-047053-5 / 9780080470535 |
Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich