Stata: Correlation and Covariance

21 Apr 2011

Correlation and Covariance

This post will illustrate how to:

Create a correlation matrix of variables using the correlate command.
Display a correlation matrix as a covariance matrix.
Obtain the statistical significance of a correlation using the pwcorr command.

Correlation Matrix

We’ll use the auto dataset for this tutorial.

sysuse auto, clear

We’ll create a correlation matrix of four variables – price, mpg, weight, and length.

correlation price mpg weight length

Note: We can shorten the correlation command to corr for convenience.

Covariance Matrix

If we want to create of covariance matrix, we simply add the covariance option to the correlation command.

correlation price mpg weigh length, covariance

Statistical Significance of a Correlation

The correlation command produces a clean correlation matrix (or covariance matrix with the covariance option). If we want to see the statistical significance of a correlation, we need to use the pwcorr command with the sig option.

pwcorr price mpg weight length, sig

Stata: Multiple Regression and Partial and Semipartial Correlations

21 Apr 2011

Multiple Regression

This post will:

Show how to extend bivariate regression to include multiple predictor variables.
Show how to manually create partial and semipartial correlations using residuals from a regression model.
Show how to use the pcorr command to obtain partial and semipartial correlations.

Multiple Predictors

We will again use the auto dataset.

sysuse auto, clear

Suppose we want to regress price on three variables – mpg, weight, and foreign. To add three predictors, we simply add the variables we want after the dependent variable.

regress price mpg weight foreign

This produces the following output.

Partial and Semipartial Correlations - Manual Method

Recall that a partial correlation is the relationship between x and y once the shared variance between x and x2 has been removed from x and once the shared variance between y and x2 has been removed from y. A semipartial correlation is similar except that we only remove the shared variance between x and x2 (i.e., y remains untouched). Note: Although I’ve only referenced x2, we can in principle include many control variables as our example will show.

Suppose we want to obtain the partial correlation between price and mpg controlling for weight and foreign. To do this we need the part of price that is independent of weight and foreign. We also need the part of mpg that is independent of weight and foreign. We can get this information with residuals.

The part of price independent of weight and foreign

To obtain the part of price independent of weight and foreign we regress price on weight and foreign.

regress price weight foreign

We then save the residuals for price. We’ll call this priceres

predict priceres, residuals

We now have a new variable in our dataset called priceres.

summarize priceres, detail

The part of mpg that is independent of weight and foreign

At this point, we need to repeat what we did about except substitute mpg for price. We regress mpg on weight and foreign.

regress mpg weight foreign

We save the residuals for mpg. We’ll call this mpgres (not very original, I know).

predict mpgres, residuals

We now have a new variable in our dataset called mpgres.

summarize mpgres, detail

Partial correlation

We know have everything we need for the partial correlation between price and mpg controlling for weight and foreign. Specifically, we have priceres – which is the part of price that is independent of weight and foreign. We also have mpgres – which is the part of mpg that is independent of weight and foreign. Therefore, to obtain the partial correlation we simply need to correlate priceres and mpgres.

corr priceres mpgres

Semipartial correlation

We also already have everything we need for the semipartial correlation. Recall that for the semipartial correlation we only remove the shared relationship between x and the x2 (or set of covariates). We don’t do anything with y. In our example price is our y variable. So to compute the semipartial correlation we correlate price (i.e., the “untouched” y variable) and mpgres (i.e., the part of mpg that is independent of weight and foreign).

corr price mpgres

Partial and Semipartial Correlations - `pcorr`

Thankfully Stata has a built in command for computing partial and semipartial correlations – pcorr. To obtain the partial and semipartial correlations, we type:

pcorr price mpg weight foreign

Note that the first variable listed is considered the y variable. All other variables are variables are considered x variables. Stata reports as many partial and semipartial correlations as there are x variables. Additionally, Stata reports the squared partial and squared semipartial correlations. These are interpreted as the proportion of shared variance between y and x controlling for the other x variables. The partial and semipartial correlations listed for mpg are the same as what we found above.

Stata: Bivariate Regression

19 Apr 2011

Bivariate Regression

In this post we’ll use the system dataset auto.

sysuse auto, clear

Results - auto.dta-1

To estimate the model we use the regress command in the command window. The regress command follows the general format of regress dv iv, options. Type help regress or visit the online help for regress for a description of the options available for regress. For example the regression of price on mpg is estimated as follows:

regress price mpg

Results-auto.dta-3

The output includes:

The ANOVA source table
Descriptive statistics and effect sizes
Coefficients, hypothesis tests, and confidence intervals

Standardized Coefficients

Suppose we would like Stata to report standardized coefficients. To get standardized coefficients we add the beta option to our command.

regress price mpg, beta

Results-auto.dta-4

Visualizing Regression Lines

We can visualize the relationship between two variables with a scatterplot. Stata’s graphics provide several useful commands for including regression lines on a scatterplot. We’ll discuss the lfit and lfitci commands.

To produce a scatterplot between price (y-axis) and mpg (x-axis), we use the graph twoway scatter command.

graph twoway scatter price mpg

Graph-Graph-1

Now let’s add the regression line to the plot. The lfit graph command allows us to do this (lfit stands for linear fit). However, we don’t want the regression line in isolation. We want it on top of the scatterplot. Stata lets you combine twoway graphs in one of two ways: (1) using parentheses or (2) using pipes. To add the regression line with parentheses, we type:</p>

graph twoway (lfit price mpg) (scatter price mpg)

The first set of parentheses is the regression line and the second is the scatterplot. This produces the following plot:

scatter2

To add the regression line with pipes (this produces an identical plot as above), we type:

graph twoway lfit price mpg || scatter price mpg

It can be nice to include confidence intervals on the plot. To do this we simply change the lfit command to lfitci, where the ci refers to confidence interval.

graph twoway lfitci price mpg || scatter price mpg

scatter3

Stata: Graphing Distributions

24 Nov 2010

Graphing Distributions

This post will demonstrate how:

Use the `twoway function’ plotting command to visualize distributions
Add colored shading to a graph to visualize portions of a distribution

The `twoway function` command

The twoway function plotting command is used to plot functions, such as y = mx + b. If we want to plot the density of a normal distribution across a range of x values, we type y=normalden(x). You can also include graphing options available to twoway plots (e.g., xtitle).

twoway function y=normalden(x), range(-4 4) xtitle("{it: x}") ///
ytitle("Density") title("Standard Normal Distribution")

Add Shading to a Figure

Suppose we want to shade parts of a distribution above (or below) a particular critical value. For example, we can shade a normal distribution above 1.96 and below -1.96 if we want critical values for a two-tailed test with an alpha-level of .05. To do this we will draw 3 graphs.

A normal curve from -4 to -1.96
A normal curve from -1.96 to 1.96
A normal curve from 1.96 to 4

The choice of -4 and 4 as upper and lower bounds is arbitrary. You can connect the three graphs by using a double pipe, ||, between calls to the twoway function command. We will shade the area under the curve for #1 and #3 using the recast(area) option of twoway function. We will assign the color of the shading to dark navy blue using the color(dknavy) option. We will leave the area under the curve for #2 unshaded.

twoway function y=normalden(x), range(-1.96 1.96) color(dknavy) || ///
function y=normalden(x), range(-4 -1.96) recast(area) color(dknavy) || ///
function y=normalden(x), range(1.96 4) recast(area) color(dknavy) ///
xtitle("{it: x}") ///
ytitle("Density") title("Critial Values for Standard Normal") ///
subtitle("Two-tailed test and {&alpha}=.05") ///
legend(off) xlabel(-1.96 0 1.96)

We can repeat for a one-tailed test.

twoway function y=normalden(x), range(-4 1.64) color(dknavy) || ///
function y=normalden(x), range(1.64 4) recast(area) color(dknavy) ///
xtitle("{it: x}") ///
ytitle("Density") title("Critial Values for Standard Normal") ///
subtitle("One-tailed test and {&alpha}=.05") ///
legend(off) xlabel(0 1.64)

We can also visualize other distributions available in Stata. Below, I provide an example of a t-distribution with 20 degrees of freedom

twoway function y=tden(20,x), range(-2.09 2.09) color(dknavy) || ///
function y=tden(20,x), range(-4 -2.09) recast(area) color(dknavy) || ///
function y=tden(20,x), range(2.09 4) recast(area) color(dknavy) ///
xtitle("{it: x}") ///
ytitle("Density") title("Critial Values for {it: t}-distribution with 20 df") ///
subtitle("Two-tailed test and {&alpha}=.05") ///
legend(off) xlabel(-2.09 0 2.09)

AMOS: Introduction

16 Nov 2010

Introduction to AMOS

These screencasts will introduce you to the basic interface of AMOS.

Part I

Part II

Older Newer

Psychstatistics