The (Pearson) correlation coefficient
{From the Institute of Phonetic Sciences (IFA):
http://www.fon.hum.uva.nl/}
Characteristics:
A correlation describes the strenght of an association between variables. An
association between variables means that the value of one variable can be
predicted, to some extent, by the value of the other. A correlation is a special
kind of association: there is a linear relation between the values of the
variables. A non-linear relation can be transformed into a linear one before the
correlation is calculated.
For a set of variable pairs, the correlation coefficient gives the strength of
the association. The square of the size of the correlation coefficient is the
fraction of the variance of the one variable that can be explained from the
variance of the other variable. The relation between the variables is
called the regression line. The regression line is defined as the best
fitting straight line through all value pairs, i.e., the one explaining the
largest part of the variance.
The correlation coefficient is calculated with the assumption that both
variables are stochastic (i.e., bivariate Gaussian). If one of the variables is
deterministic, e.g., a time series or a series of doses, this is called
regression analysis. In regression analysis, the interpretation of the
correlation coefficient is different from that of correlation analysis. In
regression analysis, tests on statistical significance can only be used when the
conditional probability distribution of the other variable is known or
can be guessed. However, the regression line can still be used.
If the aim is only to prove a monotonic relation, i.e., if one variable
increases the other either always increases or decreases, then the
Rank Correlation test is a better test.
H0:
The values of the members of the pairs are uncorrelated, i.e., there are no
linear dependencies.
Assumptions:
The values of both members of the pairs are Normal (bivariate) distributed.
Scale:
Interval
Procedure:
The correlation coefficient R of the pairs ( x , y ) is
calculated as:
R = { Sum( x * y ) - Sum(x) * Sum(y) / N
} /
sqrt( {Sum( x**2 ) - Sum( x )**2 / N} * {Sum( y**2 )
- Sum( y )**2 / N} )
The regression line y = a * x + b is calculated as:
a = { Sum( x * y ) - Sum(x) * Sum(y) / N
} / {Sum( x**2 ) - Sum(x)**2 / N}
b = Sum( y )/ N - a * Sum( x ) / N
Level of Significance:
The value of t = R * sqrt( ( N - 2 ) / ( 1 - R**2 )
) has a
Student-t distribution with Degrees of Freedom = N - 2.
Approximation:
If the Degrees of Freedom > 30, the distribution of t can be
approximated by a
Standard Normal Distribution.
Remarks:
This could be called the most mis-used of statistical procedures. It is
able to show whether two variables are connected. It is not able to
show that the variables are not connected. If one variable depends on
another, i.e., there is a causal relation, then it is always possible to find
some kind of correlation between the two variables. However, if both
variables depend on a third, they can show a sizable correlation without any
causal dependency between them. A famous example is the fact that the position
of the hands of all clocks are correlated, without one clock being the cause of
the position of the others. Another example is the significant correlation
between human birth rates and stork population sizes.
WARNING: the level of significance given here is only an
approximation, take care when using it! (use a table if necessary).
You can compute this correlation by clicking HERE.