Does a Given Function Really Fit Your Data?


The problem stated above is one that crops up all the time in any experimental science. We can all do the trick of taking a ruler, drawing a line through our points, and judging whether or not the fit "looks good". This method is known as "chi-by-eye" because it's a crude way of applying the principles which I'm about to discuss.

In general, a function is a "good fit" if:

  1. It falls within the error bars of most of your data points (about 70% if your errors are at the 1 s level), and
  2. Your data appear to be scattered randomly around the fit.

When we try to draw a curve through our data, most of us instinctively follow these criteria. The wonders of the Gaussian distribution, however, allow us to do more than this! To be rigorous, we should quantify the "goodness of fit" in some reasonable way, and then calculate the probability that the data are consistent with the fitting function.

If you have error bars on all your data points, this isn't as hard as it sounds. The parameter which measures the "goodness of fit" is called c2, and it is calculated like this:

chi^2 = Sum_i[(y_i - y_fit)/sigma_i]^2

In other words, you measure each point's deviation from the fit in units of its uncertainty, square it, and add them all up. Calculating it this way insures that the same c2 contribution always corresponds to the same probability that a data point is consistent with the fit.

The c2 is commonly used by fitting programs to judge curves; the best fit is the one for which this parameter is smallest. To come up with a useful measure of the probability that a curve really represents the data, however, we need to take another step.

The general form of a mathematical function will have a number of free parameters, variables which can be adjusted to change the shape of the curve. The general form of a line, for example, is y = mx +b. When a computer tries to fit a line to a data set, it adjusts m and b until the c2 has reached a minimum value. If there are only two data points, this minimum value will always be zero, because two points define a line.

The same thing happens if we try to fit three points to a quadratic curve, four points to a cubic, and so forth. Because of this, the first few data points don't give us any real information about the goodness of fit. To acknowledge this problem, we introduce the concept of degrees of freedom. We say that any attempt to fit a function to a dataset has d = N - f  degrees of freedom, where N is the number of data points and f is the number of free parameters in the function.

To work out the probability that our function represents the data set, we use the "reduced c2" parameter, which is just the c2 divided by the number of degrees of freedom. This is a way of partitioning the deviancy of our curve evenly between the significant data points. Since the expected value of any single deviation is sigma, it stands to reason that the expected value of the reduced c2 is 1, if the fit really does represent the data. If it comes out to much less than one, you probably overestimated your uncertainties; if it comes out to more than one, the probability that the fit is reasonable drops accordingly. For example, a reduced c2 of 4 says that the average data point differs from the fit by about twice its uncertainty, which is clearly unlikely.

Voluminous tables of this probability as a function of reduced c2 and degrees of freedom are easy to find in textbooks on statistics, error analysis, or the CRC. Here is a short table which will probably serve your purpose for any lab data:

reduced c2
d 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.5 3.0
1 100 65 53 44 37 32 27 24 21 18 16 11 8.3
2 100 82 67 55 45 37 30 25 20 17 14 8.2 5.0
3 100 90 75 61 49 39 31 24 19 14 11 5.8 2.9
4 100 94 81 66 52 41 31 23 17 13 9.2 4.1 1.7
5 100 96 85 70 55 42 31 22 16 11 7.5 2.9 1.0
6 100 98 88 73 57 42 30 21 14 9.5 6.2 2.0 0.6
8 100 99 92 78 60 43 29 19 12 7.2 4.2 1.1 0.2
10 100 100 95 82 63 44 29 17 10 5.5 2.9 0.6 0.1
15 100 100 98 88 68 45 26 14 6.5 2.9 1.2 0.1 <0.1
20 100 100 99 92 72 46 24 11 4.3 1.5 0.5 <0.1 <0.1
30 100 100 100 96 77 47 21 7.2 2.0 0.5 0.1 <0.1 <0.1

Now, let's go through a simple example of the process. The electrostatics experiment asks you to determine whether the deflection of an electroscope is proportional to the applied voltage. You dutifully take five data points:

Plot of data, best fit y = 0.6 + 0.0014x, R^2=0.86

applied voltage deflection
1000 V 2 +/- 1 degrees
2000 3 +/- 1
3000 6 +/- 1
4000 5 +/- 1
5000 8 +/- 1

The voltage has an uncertainty, too, but it is so much smaller than the uncertainty on the deflection that I didn't bother to list it. To calculate the c2, we first need to calculate the predicted values of the deflection, d:

V 1000 2000 3000 4000 5000
d 2 3 6 5 8
dfit 2 3.4 4.8 6.2 7.6
d-dfit 0 -0.4 1.2 -1.2 0.4
(d-dfit)^2 0 0.16 1.44 1.44 0.16

Normally we would now divide each element of the last row by s2, but since the uncertainty on each point is one degree, this has no effect. Summing all the contributions, we get a s2 of 3.2, and a reduced c2 of 3.2/3 = 1.07. We have three degrees of freedom because there are five data points, and two free parameters in a linear fit. Looking this up on the table, we see that there is only about a 30% chance that these data are consistent with the best-fit line. Another way of putting this is that the data are inconsistent with a linear relationship at the one sigma level. This is not a large inconsistency, and should prompt you to go back to the experiment; there may be a systematic error unaccounted for, or you may have just gotten unlucky. Taking more data will allow you to figure out which.


Back to advanced error analysis

 

 

Created by Ben Mathiesen

Last Updated by Andy Pawl
Sept. 2003
apawl@umich.edu