share
interactive transcript
request transcript/captions
live captions
download
|
MyPlaylist
In this video I'm going to present three final statistical techniques. The first one that I've already done was the t-test, which was when you had numerical data for two different samples, two different treatments. The second one I've already done was Chi Square test, which was when you had categorical data. In other words, something is A or B, or A or B or C.
And I've already done those two. Like I said, today we're going to do three more. Those are sort a potpourri of leftovers. All three of these are methods that I'm not going to show you how to do them mathematically. If you have a good graphic calculator it's pretty self-evident. You can just use the book and figure out how to do it. But I'm going to at least give you the idea of what they are, so that you can decide whether or not they are appropriate for your particular study.
The first of these three is called correlation. Correlation is between two variables, x and y. There's no implied or intended causality between say x and y. They're just two variables. And you put a scatter plot down, and the correlation coefficient, lowercase r, which has no units. It's just simply it's a unit-less pure number. It just tells you how much one number goes up with the other.
In other words, if x goes up, does y go up? If, as way over here on this example shows you, as x goes up y always goes up, you can have a maximum correlation coefficient of 1.0. That is, it always goes up.
If on the other hand, as x goes up y goes down, the correlation coefficient can go down to a minimum of negative 1.0. If there's really no relationship, the correlation coefficient is going to be basically 0. In other words, if you just say x goes up, we can't tell what's going to happen to y, it's basically 0.
And here's an example of one where I just made it up, at 0.65. I have no idea what it is. I do it on a calculator and hit numbers. As x goes up, y basically goes up. But you notice, it's not perfect. Here as x went up, y went down. You get the idea. It goes up and down like that.
I want to make sure you know again that correlation does not imply causality. Usually when we think of causality, we have some variable x, which we call the independent variable. And we think that that somehow causes y, the dependent variable. Well, that's not always true. You can have correlations between two variables, and not have causality at all.
For example, if you plotted oh, I don't know. Let's put, in honor of my lack of hair here, baldness on the x, and on the y, income. I suspect there's a positive correlation between the two. In other words, you might think, oh baldness causes wealth. Well, I can guarantee you that's not necessarily true. What's happening is this.
If there is baldness, and there is wealth, z, there's a confounding variable which creates-- I guess you'd call it a spurious correlation. This confounding variable in this case is probably getting older. As you get older, you get more bald, and you probably make a little bit more money until you retire. Now that's correlation.
Correlation is relatively simple. I can show you the math. No need to let the calculator do it. But sometimes we want to know more than whether two variables are correlated. We want to know, not only do they go up and down together or opposite, but how much does one variable go up or down in proportion to the other. We call that regression.
There's different kinds of regression. Linear regression, straight-line regression, is based upon the equation which you know from algebra, y equals mx plus b, where the coefficient here is the slope of the line, and b is the y-intercept. You're not going to see it as y equals mx plus b on statistical packages. But you can figure it out. The coefficient with the x is the slope.
So here's linear regression. Here's a really nice linear regression between x, which it's not always true that x is the independent variable, and y is the dependent variable. The Independent variable is the thing that is independent, and determines in some sense, the dependent variable. Sometimes it's set up that way. Sometimes there is no causality. It doesn't matter. In any case, you can get what's called a linear regression line. It's the line of best fit.
I'm sure you've seen that. It's basically a straight line which minimizes the distance from the points to the line. The method is called the method of least squares. That's what your calculator will do, or your computer, the method of least squares. It takes the distance from the points to the line, and minimizes those distances.
So the best fit line, which the computer does really, really well. It's a lot smarter at this than you and I are. Humans, we guess. We'll sometimes do it different than the next person. And in any case, the method of least squares gets the best-fitting line. The slope of which is in y equals mx plus b, the m. And the b is the y-intercept.
And you use this when you have studies where you want to find out, as x goes up, what happens to y, a trend line, let's say, over time. Or let's say you have down here age, or population, or something like that you want to find the effect on y.
There are some certain things that come up with linear regression. One is anomalies or outliers. So this looks like a really nice best-fit line, beautiful, but we got this oddball here. That's called an outlier.
You know, sometimes outliers are really important. Because they are anomalies which we can look at and figure out what we're doing wrong. So sometimes they're really important. Other times, they're just a goof. Somebody made a math error, or somebody made a measurement error. At that point you have to use your judgment as to whether or not to take the anomaly into account. One thing that is important though, is that you acknowledge what you're done, so that other people looking at your study can evaluate it properly.
Sometimes regression is not linear. It's curvilinear. Like here, you try to fit a straight line to this, you get what I call a pattern in the errors. Notice the errors up here, how far you are from the points to the line, are too high, then it's too low. And then it's too high again. If you see that it's probably a curvilinear relationship. Calculators give you all sorts of options to fit say a quadratic, or a higher order of power or an exponential. You can do that. It's still the same process. The computer or the calculator is minimizing the distance from the curve to the points.
I just want to give you one forewarning. Let's imagine you have a bunch of points that fit a curve like this. Don't try to do that. That's a really high power relationship. More likely than not, just because you got a curve, a really complex curve that fits your points. You really haven't added anything to your analysis. Science looks for elegance. Elegance is simple and powerful. A really complex relationship like that is unlikely, unless you have a fundamental scientific reason to think so, it's unlikely to be the best curve to fit to your data.
Now last thing I want to do on this particular episode is something called the one way analysis of variance, ANOVA. If you've seen the earlier test episodes, I talked about t-tests. T-tests are when you compare two samples. Oh, I don't know. A sample of men and women for some characteristic, and they both have a distribution for the men and for the women, with an x bar and a standard deviation.
And you want compare this mean to this mean, and see whether or not the difference is significant. That's a t-test for two samples. It's numerical data that gives you the level of significance. Sometimes though you have more than two samples. You have three treatment groups, four samples, whatever. And it's still numerical. And you want to find out whether there's any significant differences.
That's where ANOVAs come in. Again, let your calculator do it. I'm not going to even pretend that I can even do it manually. So here's an example. Let's make it practical.
Let's say you have fertilizer. And you want to have four different concentrations of fertilizer, 1x, 2x, 3x, 4x. Well you go out there, and you find the effect of the fertilizer on the productivity of your plot of land. So for sample one, you get a distribution. Sample two, you get a distribution. Three and four, you can see here. I set it up so that x1 has a lower productivity, the next 2x, 3x, 4.
And in ANOVA, what you're doing is, is you're testing the null hypothesis that the populations from which these samples come all have the same mean. In other words, there's no treatment effect. Just a warning for you, when you do this on your calculator or on your computer, one of the things you're going to come up with is something called a p.
I want to make sure you understand. Because it's a little confusing. A low value means that there are differences. You want to reject H0. ANOVAs are powerful in the sense that they do a lot of work for you quickly. But you have to realize, when you look at an ANOVA result, it doesn't tell you which comparisons are important.
Imagine that the fertilizer doesn't have any effect until you reach this level. So that one slid to the right. But the rest are all at the same point. That's not the point of an ANOVA. ANOVA is just saying there's some difference. Then you have to use your good scientific reasoning to figure out which one.
As part of the NIH-funded ASSET Program, students and teachers in middle and high school science classes are encouraged to participate in student-designed independent research projects. Veteran high school teacher Walter Peck, whose students regularly engage in independent research projects, presents this series of five videos to help teachers and students develop a better understanding of basic statistical procedures they may want to use when analyzing their data.