![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
Does a Given Function Really Fit Your Data?The problem stated above is one that crops up all the time in any experimental science. We can all do the trick of taking a ruler, drawing a line through our points, and judging whether or not the fit "looks good". This method is known as "chi-by-eye" because it's a crude way of applying the principles which I'm about to discuss. In general, a function is a "good fit" if:
When we try to draw a curve through our data, most of us instinctively follow these criteria. The wonders of the Gaussian distribution, however, allow us to do more than this! To be rigorous, we should quantify the "goodness of fit" in some reasonable way, and then calculate the probability that the data are consistent with the fitting function. If you have error bars on all your data points, this isn't as hard as it sounds. The parameter which measures the "goodness of fit" is called c2, and it is calculated like this:
In other words, you measure each point's deviation from the fit in units of its uncertainty, square it, and add them all up. Calculating it this way insures that the same c2 contribution always corresponds to the same probability that a data point is consistent with the fit.
The c2 is commonly used by fitting programs to judge curves; the best fit is the one for which this parameter is smallest. To come up with a useful measure of the probability that a curve really represents the data, however, we need to take another step. The general form of a mathematical function will have a number of free parameters, variables which can be adjusted to change the shape of the curve. The general form of a line, for example, is y = mx +b. When a computer tries to fit a line to a data set, it adjusts m and b until the c2 has reached a minimum value. If there are only two data points, this minimum value will always be zero, because two points define a line. The same thing happens if we try to fit three points to a quadratic curve, four points to a cubic, and so forth. Because of this, the first few data points don't give us any real information about the goodness of fit. To acknowledge this problem, we introduce the concept of degrees of freedom. We say that any attempt to fit a function to a dataset has d = N - f degrees of freedom, where N is the number of data points and f is the number of free parameters in the function. To work out the probability that our function represents the data set,
we use the "reduced c2"
parameter, which is just the c2
divided by the number of degrees of freedom. This is a way of partitioning
the deviancy of our curve evenly between the significant data points.
Since the expected value of any single deviation is Voluminous tables of this probability as a function of reduced c2 and degrees of freedom are easy to find in textbooks on statistics, error analysis, or the CRC. Here is a short table which will probably serve your purpose for any lab data:
Now, let's go through a simple example of the process. The electrostatics experiment asks you to determine whether the deflection of an electroscope is proportional to the applied voltage. You dutifully take five data points:
The voltage has an uncertainty, too, but it is so much smaller than the uncertainty on the deflection that I didn't bother to list it. To calculate the c2, we first need to calculate the predicted values of the deflection, d:
Normally we would now divide each element of the last row by s2, but since the uncertainty on each point is one degree, this has no effect. Summing all the contributions, we get a s2 of 3.2, and a reduced c2 of 3.2/3 = 1.07. We have three degrees of freedom because there are five data points, and two free parameters in a linear fit. Looking this up on the table, we see that there is only about a 30% chance that these data are consistent with the best-fit line. Another way of putting this is that the data are inconsistent with a linear relationship at the one sigma level. This is not a large inconsistency, and should prompt you to go back to the experiment; there may be a systematic error unaccounted for, or you may have just gotten unlucky. Taking more data will allow you to figure out which. Back to advanced error analysis
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created by Ben Mathiesen |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||