Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n. One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:

##Standard:
pvals<-NULL
tstat_standard<-NULL
n<-10 # sample size
nsim<-1000 # number of simulations
stddev<-1 # standard dev
mn<-0 ## mean
for(i in 1:nsim){
samp<-rnorm(n,mean=mn,sd=stddev)
pvals[i]<-t.test(samp)$p.value
tstat_standard[i]<-t.test(samp)$statistic}
## Type I error rate: about 5% as theory says:
table(pvals<0.05)[2]/nsim

But the situation quickly deteriorates as soon as adopt the strategy I outlined above:

pvals<-NULL
tstat<-NULL
## how many subjects can I run?
upper_bound<-n*6
for(i in 1:nsim){
## at the outset we have no significant result:
significant<-FALSE
## null hyp is going to be true,
## so any rejection is a mistake.
## take sample:
x<-rnorm(n,mean=mn,sd=stddev)
while(!significant & length(x)<upper_bound){
## if not significant:
if(t.test(x)$p.value>0.05){
## get more data:
x<-append(x,rnorm(n,mean=mn,sd=stddev))
## otherwise stop:
} else {significant<-TRUE}}
## will be either significant or not:
pvals[i]<-t.test(x)$p.value
tstat[i]<-t.test(x)$statistic}

Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.