Research Guides: Stata: World Bank Data

Country Name	Country Code	Series Name	Series Code	2007 [YR2007]	2008 [YR2008]	2009 [YR2009]	2010 [YR2010]
Liberia	LBR	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..
Madagascar	MDG	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..
Malawi	MWI	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	29.27341	28.03011
Mali	MLI	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..
Mozambique	MOZ	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..
Nepal	NPL	Central government debt, total	GC.DOD.TOTL.GD.ZS	42.97866	43.79999	..	33.86252
Niger	NER	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..
Uganda	UGA	Central government debt, total	GC.DOD.TOTL.GD.ZS	26.4924	33.1932	28.46653	27.54292
Zimbabwe	ZWE	Central government debt, total	GC.DOD.TOTL.GD.ZS	..	..	..	..

Data from database: World Development Indicators
Last Updated: 02/01/2017

The World Bank Databank, which includes the World Development Indicators among other datasets,may be the perfect source for cross-national panel data on economic, social, and health topics. However, if you download from their website using the default settings, you may find that it is not optimally set up for a panel data analysis.

There are three big problems you'll see:

There are separate columns for each year, but you actually want the year to be a variable of its own. This data layout is called wide data, but you want your data to be long.
The default download settings indicate missing values with two periods, like so: ".." But Stata records missing values as a single period. All these columns will import as string values when they should be numeric, because of the missing values, so you won't be able to make any calculations without fixing the data.
These last few lines aren't really observations, so they will create a bunch of missing values, and a bunch of problems, if you try to rearrange the data later.

Luckily, this can all be fixed with Stata. You DO NOT NEED to cut and paste in Excel. Instructions follow.

Options

Option 1: wbopendata
Use a free add-on for Stata that does the World Bank downloading for you.
Option 2: Wide to Long
Download as usual but reshape using Stata.
Option 3: Download in Long Format
Skip ahead by downloading in long format to begin with.

Option 1: wbopendata

This is an add-on module that brings World Bank data directly into Stata for you.

Install by typing into your Command window like so:
```
ssc install wbopendata
```
Next open up the interface for selecting your data
```
db wbopendata
```
This menu will give you some choices about the data you want. If you select a country or multiple countries under Options for download -- Country WDI or All Series, it will give you all variables for those countries. Alternatively, you can select one variable next to Indicators -- All series, and get that variable either for all countries, if none are selected, or for any country you select. Most likely, if you want a number of variables on the same topic, for all countries, select a topic next to Topics -- WDI series, for example 3 - Economy & Growth.

It's important to check the box next to "Import the data in the long format," as well as "Replace data in memory" if you already have data loaded into Stata.

Downloading may take a while, but give it a second, then open up the Data Editor to see your data beautifully laid out in long format.

countryname	countrycode	iso2code	region	regioncode	year	bm_gsr_cmcp_zs	bm_gsr_fcty_cd	bm_gsr_gnfs_cd
Afghanistan	AFG	AF	South Asia	SAS	1979	34.9931	2.40E+07	7.70E+08
Afghanistan	AFG	AF	South Asia	SAS	1980	17.5939	6.90E+06	9.20E+08
Afghanistan	AFG	AF	South Asia	SAS	1981	16.3703	2.00E+07	1.10E+09
Afghanistan	AFG	AF	South Asia	SAS	1982	10.3989	2.00E+07	9.70E+08
Afghanistan	AFG	AF	South Asia	SAS	1983	9.82736	2.10E+07	1.00E+09
Afghanistan	AFG	AF	South Asia	SAS	1984	6.98267	1.90E+07	1.40E+09
Afghanistan	AFG	AF	South Asia	SAS	1985	9.60415	8.60E+06	1.10E+09
Afghanistan	AFG	AF	South Asia	SAS	1986	6.12245	3.50E+07	1.30E+09
Afghanistan	AFG	AF	South Asia	SAS	1987	13.4997	1.10E+07	1.10E+09
Afghanistan	AFG	AF	South Asia	SAS	1988	8.83333	1.20E+07	8.50E+08
Afghanistan	AFG	AF	South Asia	SAS	1989	10.9284	7.90E+06	7.30E+08

Option 2: Wide to Long

I definitely recommend using wbopendata, but if you insist, here are instructions for starting with the Excel default downloads from World Databank.

Use CSV files instead of Excel. You can do this either by selecting Download Options > CSV in World Bank Databank before downloading, or, if you already have an Excel file, opening it in Excel and saving as a CSV. This is a simpler file format that creates fewer problems when importing into Stata. While you are in Excel, delete those last few lines that aren't actually data ("Data from database: World Development Indicators," etc.).
In Stata, go to File > Import > Text data (delimited, csv, etc.). This will bring up the import menu.

First, use the Browse button to find your CSV file. Next, make sure the drop down under "Use first row for variable names" is set to Always. Finally, in the data preview at the bottom of the menu, scroll over to look at your variables for each year. Because of those pesky double-period missing values, they will import as strings, which is why they are in red. You want them in numeric format. To fix this, click on the column so that they are highlighted in blue (hold down the Ctrl key to select more than one at once), then right-click and select "Force selected columns to use numeric types." Now they should turn black, indicating a numeric variable.

Now click OK. The data should now import.
The command now in the Review window should like something like this:
```
import delimited "C:\Users\yourusername\Downloads\Data_Extract_From_World_Development_Indicators.csv",
 varnames(1) numericcols(5 6) clear
```
Now you can look at your data in the Data Editor and check that everything looks as you expect. If you neglected to convert the year variables to numeric type, they will show up in red in the Data Editor. This can be fixed with a command like so:
```
destring yr2009, replace force
```
Let's make sure we deleted those stray rows at the bottom that aren't really data.
```
drop if countrycode==""
```
You still need to reshape the data so that the years are in their own variable, rather than in the variable names. This is the purpose of the reshape command. The help document for reshape gives a useful illustration (hint: type "help reshape" in the Command window). In our case, the command will look like:
```
reshape long yr, i( countryname countrycode seriesname seriescode ) j(year)
```
yr comes from yr2000, yr2001, etc.: it is a bit of text that every numeric variable starts with. This is the "stub." The year information should be the second half of the variable name. In i(), you'll put all the identifying variables that uniquely identify every country and year. In j(), you'll put the name of the new variable you are creating. It can be anything, but in this case year makes the most sense.

Now our data is long in the extreme. Take a look:

countryname	countrycode	seriesname	seriescode	year	yr
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2007	48.888
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2008	48.473
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2009	48.018
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2010	47.529
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2011	47.018
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2012	46.499
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2013	45.985
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	2014	45.483

If you only have one series, this is fine. You can drop the seriesname and seriescode variables and rename yr to the series name. You can see how you could do all these steps for multiple series and then merge them together on country and year variables. However, if you have multiple series in this file, you have a problem, because it is too long. You want each of these series to have their own column. We can do this by reshaping again.

First let's use encode to turn the seriesname variable into a numeric variable with value labels.

encode seriesname, gen(series)
codebook series

codebook will tell you what these new numeric variables are, like so:

------------------------------------------------------------------------------------------------
series                                                                               Series Name
------------------------------------------------------------------------------------------------

                  type:  numeric (long)
                 label:  series

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/960

            tabulation:  Freq.   Numeric  Label
                           480         1  Birth rate, crude (per 1,000
                                          people)
                           480         2  Debt service on external debt,
                                          long-term (TDS, current US$)

A little more housekeeping: let's drop the seriesname and seriescode, because we don't need them. And, let's rename yr to var, because it doesn't contain the year, actually.

drop seriesname seriescode
rename yr var

Now, we reshape yet again. (The last time?)

reshape wide var, i( countryname countrycode year) j( series)

Now, the data look great.

countryname	countrycode	year	var1	var2
Angola	AGO	2007	48.888	4.40E+09
Angola	AGO	2008	48.473	1.60E+09
Angola	AGO	2009	48.018	3.50E+09
Angola	AGO	2010	47.529	2.30E+09
Angola	AGO	2011	47.018	2.80E+09
Angola	AGO	2012	46.499	4.20E+09
Angola	AGO	2013	45.985	4.60E+09
Angola	AGO	2014	45.483	5.90E+09

Refer back to the codebook output to understand which variable is which. You can easily rename the variables with the rename command.

Option 3: Download in Long Format

Many do not realize that the World Bank DataBank has advanced download options that let you select the long format from the beginning. If you are downloading more than one series, however, you will still need to reshape the data a bit to get separate columns for each series.

After you've selected the countries, series, and years you'd like to download, click on Download options in the upper right, and select "Advanced options."
Now, change the Data Format to List (this means long) and the NA Preference to Blank. I recommend using CSV files instead of Excel files as well.
If you've downloaded more than one series, once you've imported this into Stata, you'll need to follow step 6-8 from Option 2 above to reshape the data. But, you've saved some time by using the advanced download options.