![]() |
There are several common pitfalls in using correlation. Correlation is symmetrical, not providing evidence of which way causation flows. If other variables also cause the dependent variable, then any covariance they share with the given independent variable in a correlation may be falsely attributed to that independent. Also, to the extent that there is a nonlinear relationship between the two variables being correlated, correlation will understate the relationship. Correlation will also be attenuated to the extent there is measurement error, including use of sub-interval data or artificial truncation of the range of the data. Correlation can also be a misleading average if the relationship varies depending on the value of the independent variable ("lack of homoscedasticity"). And, of course, atheoretical or post-hoc running of many correlations runs the risk that 5% of the coefficients may be found significant by chance alone.
Beside Pearsonian correlation (r), the most common type, there are other special types of correlation to handle the special characteristics of such types of variables as dichotomies, and there are other measures of association for nominal and ordinal variables. Regression procedures produce multiple correlation, R, which is the correlation of multiple independent variables with a single dependent. Also, there is partial correlation, which is the correlation of one variable with another, controlling both the given variable and the dependent for a third or additional variables. And there is part correlation, which is the correlation of one variable with another, controlling only the given variable for a third or additional variables. Click on these links to see the separate discussion.
Note: In older works, predating the prevalence of computers, special computation formulas were used for computation of correlation by hand. For certain types of variables, notably dichotomies, there were computational formulas which differed one from another (ex., phi coefficient for two dichotomies, point-biserial correlation for an interval with a dichotomy). Today, however, SPSS will calculate the exact correlation regardless of whether the variables are continuous or dichotomous.
Significance of correlation coefficients is discussed below in the frequently asked questions section.
where d is the difference in ranks. In SPSS, choose Analyze, Correlate, Bivariate; check Spearman's rho.
Dichotomies
However, when the continuous variable is ordered perfectly from low to high, then even when the dichotomy is also ordered as perfectly as possible to match low to high, r will be less than 1.0 and therefore resulting r's must be interpreted accordingly. Specifically, point-biserial correlation will have a maximum of 1.0 only for the datasets with only two cases, and will have a maximum correlation around .85 even for large datasets, when the independent is normally distributed. The value of r may approach 1.0 when the continuous variable is bimodal and the dichotomy is a 50/50 split. Unequal splits in the dichotomy and curvilinearity in the continuous variable will both depress the maximum possible point-biserial correlation even under perfect ordering. Moreover, if the dichotomy represents a true underlying continuum, correlation will be attenuated compared to what it would be if the dichotomy were coded as a continuous variable.
Note that tetrachoric correlation matrices in SEM often provide very inflated chi-square values and underestimated standard errors of estimates due to larger variability than Pearson's r. Moreover, tetrachoric correlation can yield a nonpositive definite correlation matrix because eigenvalues may be negative (reflecting violation of normality, sampling error, outliers, or multicollinearity of variables). These problems may lead the researcher away from SEM altogether, in favor of analysis using logit or probit regression.
Linearity can be checked visually by plotting the data. In SPSS, select Graphs, Scatter/Dots; select Simple Scatter; click Define; let the independent be the x-axis and the dependent be the y-axis; click OK. One may also view many scatterplots simultaneously by asking for a scatterplot matrix: in SPSS, select Graphs, Scatter/Dots, Matrix, Scatter; click Define; move any variables of interest to the Matrix Variable list; click OK.
Even the less conservative rule is very stringent when testing many coefficients. For instance, if one is testing 50 coefficients, the highest coefficient should be tested at .05/50 = .001 level. Since in social science, relationships are often not strong and often the researcher cannot amass large samples, such a test will mean a high risk of a Type II error (thinking a correlation is not significant when it is). In reality, most researchers simply apply .05 across the board regardless of the number of coefficients, but one should realize that the significance of 1 in 20 coefficients is apt to be spurious when using the customary 95% confidence level. This is mainly a danger when doing post hoc analysis without à priori hypotheses to be tested.
so in this case
Z = ln(|1.3/-.7|)/2 = ln(1.8571)/2 = .6190/2 = .3095 = the value shown in the table below
r z' 0.0000 0.0000 0.0100 0.0100 0.0200 0.0200 0.0300 0.0300 0.0400 0.0400 0.0500 0.0500 0.0600 0.0601 0.0700 0.0701 0.0800 0.0802 0.0900 0.0902 0.1000 0.1003 0.1100 0.1104 0.1200 0.1206 0.1300 0.1307 0.1400 0.1409 0.1500 0.1511 0.1600 0.1614 0.1700 0.1717 0.1800 0.1820 0.1900 0.1923 0.2000 0.2027 0.2100 0.2132 0.2200 0.2237 0.2300 0.2342 0.2400 0.2448 0.2500 0.2554 0.2600 0.2661 0.2700 0.2769 0.2800 0.2877 0.2900 0.2986 0.3000 0.3095 0.3100 0.3205 0.3200 0.3316 0.3300 0.3428 0.3400 0.3541 0.3500 0.3654 0.3600 0.3769 0.3700 0.3884 0.3800 0.4001 0.3900 0.4118 0.4000 0.4236 0.4100 0.4356 0.4200 0.4477 0.4300 0.4599 0.4400 0.4722 0.4500 0.4847 0.4600 0.4973 0.4700 0.5101 0.4800 0.5230 0.4900 0.5361 0.5000 0.5493 0.5100 0.5627 0.5200 0.5763 0.5300 0.5901 0.5400 0.6042 0.5500 0.6184 0.5600 0.6328 0.5700 0.6475 0.5800 0.6625 0.5900 0.6777 0.6000 0.6931 0.6100 0.7089 0.6200 0.7250 0.6300 0.7414 0.6400 0.7582 0.6500 0.7753 0.6600 0.7928 0.6700 0.8107 0.6800 0.8291 0.6900 0.8480 0.7000 0.8673 0.7100 0.8872 0.7200 0.9076 0.7300 0.9287 0.7400 0.9505 0.7500 0.9730 0.7600 0.9962 0.7700 1.0203 0.7800 1.0454 0.7900 1.0714 0.8000 1.0986 0.8100 1.1270 0.8200 1.1568 0.8300 1.1881 0.8400 1.2212 0.8500 1.2562 0.8600 1.2933 0.8700 1.3331 0.8800 1.3758 0.8900 1.4219 0.9000 1.4722 0.9100 1.5275 0.9200 1.5890 0.9300 1.6584 0.9400 1.7380 0.9500 1.8318 0.9600 1.9459 0.9700 2.0923 0.9800 2.2976 0.9900 2.6467
t = [r*SQRT(n-2)]/[SQRT(1-r2)]
where r is the absolute value of the correlation coefficient and n is sample size, and where one looks up the t value in a table of the distribution of t, for (n - 2) degrees of freedom. If the computed t value is as high or higher than the table t value, then the researcher concludes the correlation is significant (that is, significantly different from 0). In practice, most computer programs compute the significance of correlation for the researcher without need for manual methods. By default, the test is two-tailed.
SE = SQRT[(1/(n1 - 3) + (1/(n2 - 3)]
where n1 and n2 are the sample sizes of the two independent samples
Example. Let a sample of 15 males have a correlation of income and education of .60, and let a sample of 20 females have a correlation of .50. We wish to test if this is a significant difference. The z-score conversions of the two correlations are .6931 and .5493 respectively, for a difference of .1438. The SE estimate is SQRT[(1/12)+(1/17)] = SQRT[.1422] = .3770. The z value of the difference is therefore .1438/.3770 = .381, much smaller than 1.96 and thus not significant at the .05 level. (Source, Hubert Blalock, Social Statistics, NY: McGraw-Hill, 1972: 406-407.)
t = (rxy - rzy)* SQRT[{(n - 3)(1 + rxz)}/ {2(1 - rxy2 - rxz2 - rzy2 + 2rxy*rxz*rzy)}]
If the computed t value is as great or greater than the cutoff value in the t-table, then the difference in correlations is significant at that level (the t-table will have various cutoffs for various significance levels such as .05,.01, .001). (Source, Hubert Blalock, Social Statistics, NY: McGraw-Hill, 1972: 407.)
For tests of difference between two dependent correlations or the difference between more than two independent correlations. see Chen and Popovich (2002) in the bibliography.
Variance as well as correlation must be taken into account in validation. Two proposed measures may have identical correlations with the validation measure (with amount worked in the example above) but this does not mean the two proposed measures are equal. It is perfectly possible for the two measures to differ, even substantially, in variance. Correlation only shows that the proposed measure and the validation measure go up and down together to a degree relected by the correlation coefficient. Correlation does not say that the spread up and down will be the same for equal correlations. For two proposed measures with the same correlation, the one with lower variance would be preferred.
Copyright 1998, 2008 by G. David Garson.
Last update 1/24/08.