Tests of Statistical Significance
Wilks, D. S., 2011: Statistical methods in the Atmospheric Sciences (3rd Edition). Academic Press, NY, 676 pp (WI)
Spiegel M. R and L.J. Stephens: Statistics, 3rd Edition, Schaum outline series, McGrow Hill (SS)
1. Motivation: Why do we need statistical significance Tests?
Statistical methods are developed to understand stochastic processes in natural and human systems. These methods are often based on parameters that describe a distribution (as, for example, the mean and standard deviation). In addition, one may be interested in finding temporal trends, cycles, shifts, compare proportions and explore relationships among distinct variables. Since most parameters and statistical relationships (such as correlations, regressions, etc), are obtained with limited number of observations a fundamental question always arise: What if we add more and more data to our sample, would these relationships and parameters remain the same? How confident can we be that sampling will not significantly change our results? Are the parameters obtained ‘by chance’ or do they represent the actual behavior of the population?
Statistical significant tests are introduced as an attempt to answer these questions. Therefore, from now on, it is important to understand that all statistics that you may obtain from your data set needs to be tested with respect to its significance. Some tests will be discussed here.
2. Parametric versus non-parametric tests (WI, chapter 5)
There are two contexts in which hypothesis tests are performed (or two types of tests).
a) Parametric tests : are those conducted in situations where one knows or assumes that a particular theoretical distribution is an appropriate representation for the data and /or the test statistic.
b) Non parametric tests: are conducted without assumptions that particular theoretical forms are appropriate in a given situation.
Some statisticians have done a good work in finding theoretical distributions of some of the parameters we calculate. Fitting such distribution amounts to distilling the information contained in a sample of data, so that the distribution parameters can be regarded as representing the nature of the underlying physical process of interest. Thus a statistical test concerning a physical process of interest can reduce to a test pertaining to a distribution parameter, such as the Gaussian mean µ.
Non-parametric (or distribution-free) tests are constructed without the necessity of assumptions about what, if any, theoretical distribution pertains to the data at hand. Two approaches can be identified:
i) To construct the test in such a way that the distribution of the data is unimportant;
ii) Critical aspects of the relevant distribution are inferred directly from the data, by repeated computer manipulations of the observations. These nonparametric tests are known broadly as RESAMPLING procedures. They are often used in climate sciences and we will discuss a few cases later.
In any case, it is important to know the sampling distribution to all statistical tests. Sometimes it is common to confuse the sampling distribution of the parameter with the sampling distribution of the variable that we are testing the parameter. They are not necessarily the same. We will show some examples.
3. Elements of any Hypothesis test (WI)
Any hypothesis test proceeds according to the following five steps (WI):
1. Identify a test statistic that is appropriate to the data and question at hand. The test statistic is the quantity computed from the data values that will often be the sample estimate of the parameter of a relevant theoretical distribution. In nonparametric resampling tests there is nearly unlimited freedom in the definition of the test statistic.
2. Define the NULL HYPOTHESIS, usually denoted as Ho. The null hypothesis constitutes a specific logical frame of reference against which to judge the observed test statistic. Often the null hypothesis is the ONE WE HOPE TO REJECT.
3. Define the ALTERNATIVE HYPTHESIS, HA. Many times, the alternative hypothesis will be as simple as “Ho is not true “ , although more complex alternative hypothesis are also possible.
4. Obtain the NULL DISTRIBUTION, which is simply the sampling distribution of the test statistic given that the null hypothesis is true. Depending on the situation, the null distribution may be an exactly known parametric distribution, a distribution that is well approximated by a known parametric distribution, or an empirical distribution obtained by resampling the data.
5. Compare the observed test statistic to the null distribution. If the test statistic falls in a sufficiently improbable region of the null distribution, Ho is rejected. Note that not rejecting Ho does not mean that the null hypothesis is true, only that there is INSUFFICIENT EVIDENCE to reject this hypothesis. When Ho is not rejected one can really only say that it is ‘NOT INCONSISTENT’ with the observed data.
4. Test Levels and p Values
Supposed that you calculated the mean temperature (observation minus the climatology) in January during all El Nino Years at a given location, from 1979-2011. Suppose that the value of the mean temperature +15oC in La Nina Years and +12oC in El Nino Years. You want to test whether these two mean differ from each other, given the observed standard deviation of the temperature observed during the whole 1979-2011 period and during El Nino and La Nina Years. The null hypothesis is therefore the one that we wish to reject, that is, that the means cannot be considered different at some level of probability. We actually know the distribution of our Ho hypothesis, which is a normal distribution that has a shape form similar to the one shown below. The central value of this distribution is zero. Therefore, in order to reject Ho (which, in simple words, means to reject the fact that the difference is too close to the center of the Ho distribution) you have to compare the actual distribution of HA with the one of Ho. In very simple words, your calculation should be as far away from the center of the Ho distribution as possible. Thus, we need to assume the ‘sufficiently improbable’ region of the Ho distribution to perform the test. This level is shown in red the Fig. 1 (For normal distribution. T-student distribution is similar, with more elongated tails).
Figure. 1 Example of the sampling distribution of the null Hypothesis: http://www.psychstat.missouristate.edu/introbook/sbk26m.h
This probability level (also called significance level) should be small enough so we assure that the statistic that we calculated are far enough from the center of the Ho distribution. Nonetheless, it should not be so small such that we accept Ho when Ho is not true. We need to find the best way to balance these issues.
In the example of the means, z in the horizontal axis represents a transformation of the calculated variable into a z-variable that has a normal distribution (see how to do that in test of the difference of the mean). If the calculated statistic is greater than the Zcrit (which separate the region with probability p=0.05 and 0.01 in the figure above), then we can say that we reject Ho.
The test level is chosen in advance of the computations, but it depends on the particular INVESTIGATOR’S JUDGMENT, so that there is usually a degree of arbitrariness about its specific value. Commonly the 5% level is chosen, although tests conducted at 10% level or 1% level are not unusual. In any case, always try to start with 5% level and avoid levels greater than 10% (nobody will believe you). The critical level is also called p level (OR SIGNIFICANCE LEVEL) and the complement (1-p) is known as (CONFIDENCE INTERVAL). Often, it is usual to refer to p with the Greek letter α.
5. Error Types and Power of a Test
Another way f looking at the level of a test is a s the probability of falsely rejecting the null hypothesis, given that it is true. This false rejection is called a TYPE I ERROR, and its probability (the level of the test) is often denoted α. In other words, in testing a given hypothesis, the maximum probability with which we would be willing to risk a Type I error is called the LEVEL OF SIGNIFICANCE, OR SIGNIFICANCE LEVEL, of the test.
If, for instance, the 0.05 (or 5%) significance level is chosen in designing a decision rule, then there are about 5 chances in 100 that we would reject the hypothesis Ho when it should be accepted; that is, we are about 95% CONFIDENT that we have made the right decision. In such case we say that the hypothesis has been rejected at 0.05 significance level, which means that the hypothesis has a 0.05 probability of being wrong.
On the other hand, the false rejection of Ho when it is actually not false is called Type II error. The probability of Type II error is denoted as β. Figure 2 illustrates these concepts.
Figure. 2 The sampling distribution of the left represents the Ho distribution and in the right the HA distribution. The yellow region represents the area equal β
6. One-sided versus two-sided tests (or one-tailed or two-tailed tests) (WI,SS)
A statistical test can be either one-sided or two-sided (or one tailed or two tailed), since it is the probability in the extremes (tails) of the null distribution that governs whether a test result is interpreted as being significant. Whether a test is one-or-two sided depends on the nature of the hypothesis being tested.
For instance, supposed that you calculated the temperature anomaly in January during all El Nino years (from 1979-2011) and found that it is equal +2.9oC. You want to test whether this anomaly is significantly different from zero, the value of the mean anomaly during 1979-2011. Anomalies can be positive or negative, therefore your test will be a two tailed test.
Often, however, we may be interested on ly in extreme values to one side of the mean (i.e., in one tail of the distribution), such as when we are testing the hypothesis that the mean temperature in January during La Nina years is greater than the average temperature during neutral years. Such tests are called one-sided tests, or one-tailed tests. In such cases the critical region is a region to one side of the distribution, with area equal to the level of significance (Figure 1 illustrate z distribution in this case).
7. Confidence interval: (WI,SS)
Hypothesis testing can be used to construct CONFIDENCE INTERVALS around sample statistics. These are intervals constructed to be wide enough to contain with a specific probability, the population quantity (often a distribution parameter) corresponding to the sample statistic. A typical use of confidence intervals is to construct error bars around plotted sample statistics in a graphical display.
In essence, a confidence interval is derived from the hypothesis test whose null hypothesis is that the value of the observed sample statistic corresponds exactly to the population value it estimates. The confidence interval around this sample statistic then consists of other possible values of the sample statistic for which Ho would not be rejected.
8. Small Sampling Theory (SS)
Usually we say that samples of size N>30 are called large sample and the sampling distribution of many statistics are approximately normal. This approximation becomes better as N increases. For samples of size N<30, called small samples, this approximation is not good and becomes worse with decreasing N., so that appropriate modifications must be made.
A study of sampling distributions of statistics for small samples is called small sampling theory. However, a more suitable name would be ‘EXACT SAMPLING THEORY’, since the results obtained hold for large (as well as for small) samples.
The Student t Distribution
We define the statistic:
Which is analogous to the statistic z:
This distribution is shown in the Figure 3:
Figure 3: Student’s t distributions for various values of ν=(N-1) (ν number of degrees of freedom). Note that the large differences from the Gaussian is produced for small values of ν
It is called ‘Student’s t distribution after its discoverer, W. S. Gossett, who published his works under the pseudonym “Student” during the early part of the 20th century.
The NUMBER OF DEGREES OF FREEDOM OF A STATISTIC, generally denoted by ν, is defined as the number N of the independent observations in the sample (i.e, the sample size) minus the number k of population parameters, which must be estimated from sample observations ( for instance, k=1 when the parameter “mean “ is calculated because it is estimated from the sample observations). In symbols, ν=N-k.