Using Stata to "Reshape" From Wide to Long with One Variable of Interest:

If the dataset contains only one variable of interest recorded over time (for instance, you want data on GDP from the WDI and plan to merge those data into a bigger dataset), there are a few simple commands in Stata you can use to change from the 'wide' to 'long' format.

Say that we have gotten some data from the WDI on GDP for several years for multiple countries and have moved the data into Stata. When you import datafiles into Stata, Stata often assigns generic names to variables - v1, v2, v3, etc.:



Click here for full-size image

Consequently, it is often necessary to rename those variables before reshaping the file. First, each year in the 'wide' format must have a string component as part of its variable name for the reshape command to work. So, if you import data from Excel into Stata, you need to use the -rename- command to rename each year with some variable name and the year of that column. Note that you will need to assign the correct year carefully. Generally, the syntax will be along the lines of this generic example:

rename v1 country
rename v2 xyzyear1
rename v3 xyzyear2
rename v4 xyzyear3

(Note that the string portion must be the same for every year, because Stata uses that as part of the variable name when it reshapes the data into 'long' format; thus the string portion of the variable name should be the name you want the variable to have in the reshaped file.)

And, after renaming the variables, you should see something like this when browsing the dataset:

Country
xyz1981
xyz1982
xyz1983
Austria
25
10
8
Belgium
14
30
17
Denmark
2
18
4

To refer to our specific example of the datafile containing data on GDP, we would enter the following commands:

rename v1 country
rename v2 gdp1981
rename v3 gdp1982
rename v4 gdp1983

And see the following when we browse our dataset:



Click here for full-size image

Once you have finished renaming the variables, check to see if if the first row of observations contains column labels. In our example, this is in fact the case:



Click here for full-size image

In this situation, you need to delete that row using this command:*

drop if _n==1

*The _n is an internal indexing mechanism that Stata automatically does for each row of data. _n==1 refers to the first row of data in the dataset. The _n values do not "stick" with their rows if the data are sorted differently, so make sure you are deleting the correct row if you have sorted the data. Also notice that the command calls for a double equals sign (==) not just a single sign.

This will get rid of the column labels as a row of data:



Click here for full-size image

In Stata's command window, we will now type the commands to reshape the data. To do so, we must type "reshape" (i.e. the command name); then, we must specify which shape (long or wide) we want to reshaped data to be in; then we must list the variable(s) to be reshaped; then, after a "," we must specify the "i" and "j" variables, which are used to uniquely identify cases. To reshape data into a 'long' format for time-series analysis, the syntax will be something like this generic example:

reshape long xyz, i(country) j(year)

This command tells Stata to create the column variable xyz indexed by country ("i") and year ("j"). Note that xyz is the stem of the original xyzyear variables and is the name that we want the variable to have in our reshaped file. "i" is the original row variable; with time-series data, this is what is called the "cluster" variable. "j" is a new variable that is created by the reshaping process; it is taken from the suffix of the xyzyear variables in our original, "wide" datafile and it uniquely identifies cases within each cluster. Stata will then report all of the changes made, the new observation and variable counts, and any errors it encountered.

To go back to our specific example, we would type

reshape long gdp, i(country) j(year)

This command will create a variable called gdp. The "i(country)" option will tell Stata that the "country" variable will identify each group of observations within the datafile. The "j(year)" option, meanwhile, will create a new variable ("year") that will uniquely identify observations within each cluster of observations.

When browsing the dataset, we then see that the data have in fact been changed into a 'long' format:



Click here for full-size image

Be sure to save the dataset when you finish. The Data Gods are mercurial and not prone to forgiveness.

Stata's FAQ for using the -reshape- command offers additional insights and tips.




Data Analysis

Page adapted from Electronic Data Center, Emory University Libraries
Original text by Amy Yuen