Why can bigger sample size increase power of a test?

The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test.

I wonder why it is often said that a bigger sample size can increase the power (i.e. true positive rate) of a test in general. Does bigger sample size always increase testing power? Added: Suppose at each sample size $n$, reject null iff $T_n(X) \geq c_n$. How power changes with $n$ depends on how $T_n$ and $c_n$ are defined in terms of $n$, doesn't it? Even if $c_n$ is chosen so that the size of the testing rule is a value $\alpha \in [0,1]$ fixed for all $n$ values, will the power necessarily increase with $n$? Explanations that are rigorous and intuitive are both welcome. Thanks!

asked Oct 20, 2012 at 19:30 19.7k 35 35 gold badges 122 122 silver badges 220 220 bronze badges

$\begingroup$ It is trivial to show that larger sample size in the IID setting cannot decrease testing power. In the worst case, if you have a good test for $n$ data and you have collected $n'\gt n$ observations, you can always apply your test to a subset of $n$ of your observations. $\endgroup$

Commented Oct 20, 2012 at 19:57

$\begingroup$ Thanks, @whuber. Why " larger sample size in the IID setting cannot decrease testing power"? $\endgroup$

Commented Oct 20, 2012 at 20:57

$\begingroup$ @whuber: Do you mean that the test statistic can be chosen to be arbitrary? What I think is that the test statistic and critical value are subject to some forms. For example, in z-test, reject null iff $Z_n(X) \geq z_$, where $z_$ is fixed for all $n$ values, and $Z_n(X)$ has a specific form for all $n$ values. Its power increases as $n$ increases and why? Is this also the same for other tests that reject null iff $T_n(X)\geq t$ for some test statistic $T$ and fixed $c$? $\endgroup$

Commented May 16, 2013 at 1:33

5 Answers 5

$\begingroup$

The power of the test depends on the distribution of the test statistic when the null hypothesis is false. If $R_n$ is the rejection region for the test statistic under the null hypothesis and for sample size $n$, the power is $$\beta = \mbox(X_n \in R_n | H_A)$$ where $H_A$ is the null hypothesis and $X_n$ is the test statistic for a sample of size $n$. I am assuming a simple alternative --- although in practice, we usually care about a range of parameter values.

Typically, a test statistic is some sort of average whose long term behaviour is governed by the strong and/or weak law of large numbers. As the sample size gets large, the distribution of the test statistic approaches that of a point mass --- under either the null or the alternative hypotheses.

Thus, as $n$ gets large, the acceptance region (complement of the rejection region), gets smaller and closer to the value of the null. Intuitively, probable outcomes under the null and probable outcomes under the alternative no longer overlap - meaning that the rejection probability approaches 1 (under $H_A$) and 0 under $H_0$. Intuitively, increasing sample size is like increasing the magnification of a telescope. From a distance, two dots might seem indistinguishably close: with the telescope, you realize there is space between them. Sample size puts "probability space" between the null and alternative.

I am trying to think of an example where this does not occur --- but it is hard to imagine oneself using a test statistic whose behaviour does not ultimately lead to certainty. I can imagine situations where things don't work: if the number of nuisance parameters increases with sample size, things can fail to converge. In time series estimation, if the series is "insufficiently random" and the influence of the past fails to diminish at a reasonable rate, problems can arise as well.

answered Oct 20, 2012 at 22:36 14.5k 6 6 gold badges 42 42 silver badges 73 73 bronze badges

$\begingroup$ In a two sample situation, increasing the sample size of one group to infinity does not send the power of the test to 1. The power will be limited by the sample size of the smaller group (or, to be precise, a combination of the variances within the groups and the sample sizes, if you think about a $t$-test). Dependent data can lead to all sorts of problems; that's why the time series people came up with the concepts of mixing as a way to approximate independence for observations that are "far enough" from one another. $\endgroup$

Commented Oct 21, 2012 at 3:08 $\begingroup$ These would be cases where the laws of large numbers wouldn't apply. $\endgroup$ Commented Oct 21, 2012 at 3:36 $\begingroup$

Here's one intuitive answer: In the real world, you are almost always sampling from a finite population (although it may be very large). If you managed to measure the entire population, power would be infinite (well, 1.0, which is essentially like infinite power - you could detect any difference) - you would know the exact difference. The closer you get to the whole population (given that your sample is random) the more precise your estimate can be.

However, if you get away from random samples, this is no longer the case. Intuitively again, suppose you are testing the difference in height between adult men and adult women. One extreme way to be non-random is to test a sample of very short men (say, you sample from a population of jockeys) against a sample of very tall women (basketball players).

answered Oct 20, 2012 at 20:22 Peter Flom Peter Flom 125k 36 36 gold badges 182 182 silver badges 414 414 bronze badges

$\begingroup$ Thanks! So in what cases, do people say increasing sample size will increase power? The random sampling case? BTW: I think power is between [0,1], so don't understand "power would be infinite". $\endgroup$

Commented Oct 20, 2012 at 20:56

$\begingroup$ Well, really, power increases with sample size in almost any reasonable scheme, but it can only be proven to do so when there is random sampling. You're right, I should have said power would be 1.0. $\endgroup$

Commented Oct 20, 2012 at 21:01

$\begingroup$ Thanks! I wonder mathematically why power increases or nondecrease with sample size when random sampling? $\endgroup$

Commented Oct 20, 2012 at 21:41

$\begingroup$ With regards to populations being finite, I generally conceptualise my populations in psychological research as something more hypothetical. E.g., my population is not all adults currently alive; rather it is all humans that do or could exist both now and in the future assuming a similar data generating process for other humans. I.e., the population of humans that currently exist is assumed to be a sample itself from some broader data generating process. From this perspective, the population is infinite. $\endgroup$

Commented May 16, 2013 at 5:17 $\begingroup$

Thinking a little more about the part of the question:

Does bigger sample size always increase testing power?

If we are only talking about tests generally covered in an introductory statistics course and the conditions for those tests hold (e.g. simple random sample, central limit theorem gives an approximate normal, null hypothesis is false, etc.) then yes, increasing the sample size will increase the power. However, here are some cases where increasing the sample size may not increase the power:

If the underlyng distribution is a Cauchy (undefined mean, infinite variance, CLT does not apply), then increasing the sample size may not increase the power (but I don't know what test you would be doing on such data, or even a realistic case that would follow a Cauchy).

Excessive sampling causes subjects to loose interest and stop cooperating. I remember a presentation about an election in one of the Caribean island countries where the polling got so out of hand that all the registered voters were being surveyed on average every week and got so fed up that they stopped answering or just lied. The presentation showed that if they had used smaller samples for each survey then the population would not have been as frustrated and they probably would have received better results.

Response rates and cost. If you plan a mail out survey and send the survey to 1,000 people but do no other follow up then you might only get 100 responses, but if you use the same money to only send out 200 surveys, but you also send follow-up letters and/or offer incentives then you may receive 150 responses, so the actual amount of data from the smaller planned study of 200 subjects will be more than that for the planned 1,000 subjects. This can also influence data quality, an in person interview of 50 people, or a telephone interview of 100 people may yield better quality data than a mail out survey of 1,000.

The concept of power is only for the cases where the null hypothesis is false, so if the null is true then power will not be affected by sample size.

When taking multiple measurements per subject then the concept of sample size is more complicated. Which gives more power, 10 measurements on each of 20 subjects (for a total of 200 measured values) or 2 measurement on each of 50 subjects (for a total of 100 measured values), often the 2nd will give more power even though the total number of measurements is smaller.

If the parameter of interest changes over time (think of election polling) and getting the bigger sample will require more time in which things could change, then that could affect the power. Think of comparing a sample of 100 taken on a single day vs. a sample of 1,000 taken over a 2 week period (and what if there is a publicised debate, scandle, etc. during those 2 weeks).

If you have a test whose type I error is not exactly alpha, and it depends on sample size, then increasing the sample size can actually decrease the power. Consider a binomial test with null that the probability is $0.5$, the alternative to test is that it is greater, and we want to test with $\alpha=0.05$. With a sample size of $n=5$ we can only reject if we see 5 successes (Type I error rate 0.03125), with a sample of $n=10$ we will reject if we see 9 or 10 sucesses (Type I error rate 0.01074, it would be 0.05469 if we rejected at 8 sucesses). If the true probabilty is $0.6$ then the probability of rejecting (power) with $n=5$ is $0.07776$ and with $n=10$ it is $0.04646$, so doubling the sample size decreased the power, but also decreased the type I error rate, so not really a fair comparison. Increasing the sample size to where the power is meaningful ($>80\%$) will make it much harder to find examples like this (though there are probably some where increasing n by 1 decreases the power slightly).

If you are running the wrong test where assumptions are violated (e.g. using a test that assumes equal variances when they are quite different) then your power might not increase.

If you first run a normality test, then run a different second test based on the results then larger sample sizes are more likely to reject the normal test (even when the difference does not matter) and if the test you run as a result is less powerful than if normality was not rejected then increasing the sample size could reduce the power (which is one argument against pre-testing the data for normality).

There are probably other cases which like these are beyond the scope of what the wikipedia article was trying to cover.