Using Stata to "Reshape" Data with Many Variables of Interest

When we have a data file that is in "wide" format and has multiple variables in it, the reshaping process is more complicated, and we essentially have to reshape the data twice to get it into our desired, "long" format. Say that we have a data file that looks like the following:

Country
Variable
1981
1982
1983
Austria
XYZ
25
10
8
Belgium
XYZ
14
30
17
Denmark
XYZ
2
18
4
Austria
EFG
0
0
1
Belgium
EFG
1
1
1
Denmark
EFG
0
1
0

Here, there are several variables stacked vertically and indexed by country, and each year column has values for three different variables for each country. Now, it will require a few extra steps to reshape the data into a time-series format. We need to use the -reshape wide- command to separate out the stacked variables into columns by year and then use the -reshape long- command to sort out the country-year values and return to long format. Essentially, to obtain time-series data, we have to switch out the year columns for variable columns so the data are indexed by country-year instead of by country-variable, with the end result looking something like this:

Country
Year
XYZ
EFG
Austria
1981
25
0
Austria
1982
10
0
Austria
1983
8
1
Belgium
1981
14
1
Belgium
1982
30
1
Belgium
1983
17
1
Denmark
1981
2
0
Denmark
1982
18
1
Denmark
1983
14
0

Say that we have imported a text file with data into Stata and that the file contains multiple variables, such as this example:



Click here for full-size image

Here, we will first need to rename the column names so that they more clearly identify the variables and the years. Thus, we will need to enter commands along the lines of the following:

rename v1 variable
rename v2 country
rename v3 value1989, etc.

Note that column names must include a string component for Stata to accept them as "legal" variable name for use with the reshape command.

Next, if the first line of data contains column labels, we must delete it as was shown in the previous section (See -drop if _n==1- above).

In our data file here, the names of the variables we want to create come from values of the "variable" column in our current file. So, to get the data into our desired format, we also need to use the -replace- command with "if" statements so as to change the values listed in the "variable" column to the names you want the variables to be when they become the column names in our new, reshaped file. For each variable, you must use an "if" statement that refers to the relevant value(s) to be changed. Remember the following: (1) there should not be any spaces in the variable names; (2) for older versions of Stata (6 or below) the program only allows 8 characters for variable names. In the following example, 'variable' is the name of the column that identifies which variable's value is given in a particular row:

replace variable="gdp" if variable=="Gross national income (constant LCU)"

In this example, Stata will look for any case in which the value of "variable" equals "Gross national income (constant LCU)" and will replace that value with "gdp". We will need to repeat this command for each variable name listed in the "variable" column.

Once you have completed the renaming and replacing, the dataset should look like this:



Click here for full-size image

At this point, the -reshape wide- command is issued. This command takes the 'variable' variable and matches it up with each 'valueyear' column, leaving only the country as the identifying variable. With the example above, this will result in 12 observations (one for each country) and 3 times the number of columns in the original dataset.

To reshape the file, you must specify what shape (long or wide) you want the resultant file to be. You must also list the variables to be reshaped. With our example data file, these variables will be the 'valueyear' columns because those are the columns that will be changed. Then, after a "," we must specify the i() and j() variables. Here, we are reshaping from long to wide; thus, the "i" variable is the variable that will uniquely identify cases in the wide datafile, while the "j" variable will provide the suffixes for the reshaped variables. For our example datafile, the command will look like this:

reshape wide value1989 value1990 value1991 value1992 value1993 value1994 value1995 value1996 value1997 value1998 value1999 value2000, i(country) j(variable) string

[Note - in this example, we have typed "string" at the end of the command because the j() variable has string (alpha-numeric) values; this option is necessary when the variable is non-numeric.]

This command will create output that looks like the following:



Click here for full-size image

And a dataset that looks like this:



Click here for full-size image

The next step is to rename all of the variables from 'valueyearvariable' to 'variableyear'

rename value1989gdp gdp1989
rename value1990gdp gdp1990, etc

so that the data file looks like this:



Click here for full-size image

These commands rename the column names so that we can separate out the years from the variables when we reshape the data from wide to long format. Then, we will have created the long, time-series format that we want for analysis.

The final command is the -reshape long- command that changes the dataset into a country-year format. The command syntax will be something like this generic example:

reshape long [variable list], i(identifying variable) j(subobservation)

When executing the -reshape long- command, the variable-list portion lists the new variables that will be created from the old columns. For this list, we need to list the non-numeric portion of each column name (i.e. we do not include the years in the new variable names). The "i" variable will identify the "cluster" variable in the time-series ('country'), and the "j" variable will identify unique observations within each cluster ('year'). This "j" variable is not actually in our wide file; it will instead be created by the reshape process as part of the new long file. Its values will come from the numeric suffixes that are part of the variable names in the file that we are reshaping.

For our specific datafile, we will type the following:

reshape long gdp growth pop, i(country) j(year)

There are three variables listed in the example above: gdp, growth and pop (which are for GDP, percent population growth and total population, respectively). These variables will be created by the reshape process. The values for these variables will come from the columns with prefixes that equal the new variable names. For instance, all the values for the gdp variable will come from those columns whose names being with "gdp" - gdp1989, gdp1990, and so on. The "year" variable, meanwhile, will be created from the numeric suffixes of variable names in the wide file that we are reshaping. The values for a variable in a given year (e.g. gdp in 1989) in the new file will thus come from the appropriate column in the old file (e.g. gdp1989). Note that including numbers as part of the variable name that are not related to the year can be problematic, so avoid labeling your variables with numbers. If need be, you can change the names after the reshape for clarity.

With our specific example, the command creates output that looks like this:



Click here for full-size image


and a dataset that looks like this:



Click here for full-size image


Be sure to save the dataset when you finish. The Data Gods are mercurial and not prone to forgiveness.

Stata's FAQ for using the -reshape- command offers additional insights and tips.




Data Analysis

Page adapted from Electronic Data Center, Emory University Libraries
Original text by Amy Yuen