Published on April 3rd, 2013
1The Logic behind Statistical Inference – Sample Theory, Population Size, and Stochastic Model Theory
By Tor G. Jakobsen
According to sample theory the size of our sample greatly influences our ability to generalize results back to the population we are investigating. Still, a couple of questions go unanswered: “Does the population size matter for the stars of significance?” and “What if we are examining the whole population?”
As we know, when our sample reaches 1000–1200 it is much easier to obtain significant results than when it is a mere 25. Sample theory centers on the central limit theorem, which briefly states that, as sample size N becomes large, the sampling distribution of the mean becomes approximately normal. Also, the sampling distribution will fall around the variable’s population mean.
This presupposes that the units either are sampled at random, or with a known probability of being chosen (which later can be adjusted for by the researcher). The latter option is often performed through stratified sampling, where the population is divided into districts with the aim of performing a closer examination of sub groups.
Our results are then represented with a probability value (p-value), which ranges between 0 and 1. It gives us the probability that the observed correlation between two variables in our sample is not present in our unobserved population. In other words, the lower our p-value is, the greater our confidence can be that the observed relationship in the sample is also present in the population. I must stress that if we find a correlation it does not necessarily imply causation (it is just one of the components) and a statistically significant result is not necessarily a substantial result (when the sample size is large).
Even though it is seemingly complicated, sampling theory should be familiar to most students of the social sciences.
Does the size of the population matter?
However, what about the size of the population? Do we get better result if we are sampling 1000 Norwegians trying to generalize back to the Norwegian population, than if we were sampling 1000 Americans and generalizing to the U.S. population? The answer is no.
If we look at the mathematics behind probability, we see that the size of the population really does not matter. A given sample size is equally useful in examining the opinions of Icelanders as it is of Chinamen.
However, there is an exception to this rule. If the size of the sample exceeds a few percent of the population the confidence intervals are becoming smaller. In other words, population size is only likely to be a factor if you are investigating a small and known group of people, like, for example an organization or sports club.
This can be illustrated by the table showing the 95 % confidence interval of 5 %:
Population | Sample |
10 |
10 |
50 |
44 |
100 |
80 |
200 |
132 |
500 |
217 |
1000 |
278 |
3000 |
341 |
100,000+ |
385 |
In essence, once the sample constitute a small share of the population (as is illustrated in the graph), the size of the population matters. As the population increases the sample size needed for a given confidence interval increases (proportionally), until it becomes relatively constant at slightly more than 380 cases.
It must be noted that if you are investigating a small and known group of people, you might want to consider a different underlying logic for your use of levels of significance. This leads us on to the next topic.
Why do I need significance levels if I am investigating the whole population?
In many cases a social scientist will be investigating a full population. It could be within international relations, study of war and peace, or in international political economy. Or it could be in business, investigating all fast food restaurants in a city.
When examining the whole population (or as close as you get, the important thing is that your aim is to examine the whole population) and not just a sample of it, you are now generalizing within stochastic model theory (rather than within sample theory).
When following sample theory, we generalize from the sample to the population. According to this logic, when one looks at the entire population, one should get perfect predictions. This is where stochastic model theory becomes useful. In fact, we are generalizing from the observation made, to the process or mechanism that brings about the actual data.
Our starting point is a nondeterministic experiment, which implies that the results of the experiment will vary, even if we try to keep the conditions surrounding it constant. It can be roughly compared to throwing an unloaded dice two times at your desk, without changing the place of your cup, books, and pencils between each throw.
Thus, the use of confidence intervals and significance levels makes sense, even if we are looking at the entire population. The lack of statistical significance indicates that the association produced by nature is no more probable than that produced by chance. We are thus dealing with a mechanism best described as an unspecified random process.
One could argue that for sample data we should then have a double test of significance. In theory this is true, as we have the uncertainty that the sample is a true mirror image of the population, and the uncertainty that the correlations in the population could be produced by chance by the unspecified random process. However, we usually state that our aim is to see if the relationships are present in the population (and not if the relationships in the population are for real), and we thus only have to operate within sample theory.
Concluding remarks
The statistical method is suitable for making generalizations that go beyond the collected data, and can thus assist the researcher in identifying patterns and regularities in the observable world. Mathematics, in the form of statistics, is a great tool that can assist the social scientist in getting access to and explaining the complexities of life. But the researcher must be aware of the underlying logic of why he or she makes those inferences.
Further reading:
Gold, David (1969) “Statistical Tests and Substantive Significance” American Sociologist, 4(1): 42–46.
Henkel, Ramon E. (1976). Tests of Significance. Beverly Hills: Sage.
Kreijcie, Robert V. & Daryle W. Morgan (1970) “Determining Sample Size for Research Activities” Educational and Psychological Measurement, 30(3): 607–610.
*Cover photo by Trevor Blake
Prof.,Thanks for the good work that you are doing in the field of research.Does the methodology adopted by the researcher influence the sampling strategy?The sample size calculator has been given for the determination of sample size but how does one draw samples from the different sub-groups out of the identified population? Is there an agreed percentage to be taken? Thanks.