## How Much Data do I Need?

January 03th, 2012

That is an excellent question that can either have an exacting answer or one that is ambiguous. Formulas exist to calculate an exact answer to the question except that until you collect some data you won’t know the values for some of the variables in the equation. In both cases, either continuous data or discrete data, we assume an infinite population that we are sampling from.

First, take a look at the case for continuous data. With continuous data we can describe a data set with the mean and the standard deviation. The mean provides the distribution’s location and the standard deviation provides the measure of the variability in the data set. If the estimate of concern is the mean of the data set the minimum sample size can be calculated from the following formula.

Where n is the estimate for the required minimum sample size, s is the standard deviation of the data we are sampling which is unknown, and d is the +/- confidence interval we want to have about the estimate of the mean based on our sample size. Our confidence interval is 95% if we are within +/- 2s of the mean, which is why we have 2s in the formula. The only problem with the formula is we have to guess at the standard deviation, s.

To handle the ambiguity of the unknown standard deviation the rule of thumb is to collect about 20 data points and then calculate an estimate of the standard deviation. Use that standard deviation in the above formula. If you need more than 20 data points then collect the additional data. With the proper amount of data points collected based upon the formula you can now estimate the mean with the confidence interval d that you have specified.

The following example will illustrate the use of the formula. We want to estimate the mean of our process, but how much data do we need? The confidence interval, d is +/- 0.5 and the standard deviation s=2.0 as calculated from our initial sample of 20 data points.

Based on the calculation above the minimum sample size is 64 so an additional 44 data points are required to estimate the process mean with a 95% confidence.

Second is the case for discrete data. With discrete data we can describe a data set with frequency of occurrence or percent of occurrence. We often view a process by percent yield (goodness) or percent defective (poorness). This can also be classified as percent agreement or percent disagreement with a point of view such as a poll question. The estimate of concern is a proportion, or percentage at a 95% confidence within a delta of some +/- percentage points.

In the case of discrete data the minimum sample n required to estimate a proportion with a 95% confidence can be calculated using the following formula. As with continuous data we specify the confidence interval d, but with discrete data it is a +/- percentage point spread about the proportion we are estimating.

Where n is the minimum sample size, p is the proportion we are trying to estimate which is unknown, and d is the +/- percentage point spread we are specifying about the estimate of the proportion. The 2 in the formula is the factor that provides the 95% confidence in the estimate of the proportion based upon the minimum sample size from the formula.

The unknown that drives the equation is the value of the proportion we are trying to estimate. Until we collect some data we really don’t know the value of the proportion. Make an educated guess at the proportion and use the formula. Collect the data and then calculate the proportion. Plug that proportion into the formula and determine is more data is required.

The following is an example. We would like to know the percentage of registered voters who are dissatisfied with the performance of the United States Congress. The confidence interval is specified as +/- 3%, or d=0.03. Because we are uncertain the proportion to use to begin with is 50%, or a p=0.5.

From the formula above we require a minimum sample of 1122 registered voters. If it turns out the actual proportion is either greater or less than p=0.5 we already have an adequate sample. Next time you see a poll in the newspaper check out the sample size and the value for d. Now you know where those numbers are coming from.

Become a member today of Educate Virtually and gain access to over 100 courses. Grow you talents with online education and training.