Skip to Main Content

Stata: World Bank Data

Country Name Country Code Series Name Series Code 2007 [YR2007] 2008 [YR2008] 2009 [YR2009] 2010 [YR2010]
Liberia LBR Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
Madagascar MDG Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
Malawi MWI Central government debt, total GC.DOD.TOTL.GD.ZS .. .. 29.27341 28.03011
Mali MLI Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
Mozambique MOZ Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
Nepal NPL Central government debt, total GC.DOD.TOTL.GD.ZS 42.97866 43.79999 .. 33.86252
Niger NER Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
Uganda UGA Central government debt, total GC.DOD.TOTL.GD.ZS 26.4924 33.1932 28.46653 27.54292
Zimbabwe ZWE Central government debt, total GC.DOD.TOTL.GD.ZS .. .. .. ..
               
Data from database: World Development Indicators    
Last Updated: 02/01/2017          

The World Bank Databank, which includes the World Development Indicators among other datasets,may be the perfect  source for cross-national panel data on economic, social, and health topics. However, if you download from their website using the default settings, you may find that it is not optimally set up for a panel data analysis.

There are three big problems you'll see:

  1. There are separate columns for each year, but you actually want the year to be a variable of its own. This data layout is called wide data, but you want your data to be long.
     
  2. The default download settings indicate missing values with two periods, like so: ".." But Stata records missing values as a single period. All these columns will import as string values when they should be numeric, because of the missing values, so you won't be able to make any calculations without fixing the data.
     
  3. These last few lines aren't really observations, so they will create a bunch of missing values, and a bunch of problems, if you try to rearrange the data later.
Luckily, this can all be fixed with Stata. You DO NOT NEED to cut and paste in Excel. Instructions follow.

Options

Option 1: wbopendata

This is an add-on module that brings World Bank data directly into Stata for you. 

  1. Install by typing into your Command window like so:
    ssc install wbopendata
  2. Next open up the interface for selecting your data
     
    db wbopendata
  3. This menu will give you some choices about the data you want. If you select a country or multiple countries under Options for download -- Country WDI or All Series, it will give you all variables for those countries. Alternatively, you can select one variable next to Indicators -- All series, and get that variable either for all countries, if none are selected, or for any country you select. Most likely, if you want  a number of variables on the same topic, for all countries, select a topic next to Topics -- WDI series, for example 3 - Economy & Growth.



    It's important to check the box next to "Import the data in the long format," as well as "Replace data in memory" if you already have data loaded into Stata.
  4. Downloading may take a while, but give it a second, then open up the Data Editor to see your data beautifully laid out in long format.

    countryname countrycode iso2code region regioncode year bm_gsr_cmcp_zs bm_gsr_fcty_cd bm_gsr_gnfs_cd
    Afghanistan AFG AF South Asia SAS 1979 34.9931 2.40E+07 7.70E+08
    Afghanistan AFG AF South Asia SAS 1980 17.5939 6.90E+06 9.20E+08
    Afghanistan AFG AF South Asia SAS 1981 16.3703 2.00E+07 1.10E+09
    Afghanistan AFG AF South Asia SAS 1982 10.3989 2.00E+07 9.70E+08
    Afghanistan AFG AF South Asia SAS 1983 9.82736 2.10E+07 1.00E+09
    Afghanistan AFG AF South Asia SAS 1984 6.98267 1.90E+07 1.40E+09
    Afghanistan AFG AF South Asia SAS 1985 9.60415 8.60E+06 1.10E+09
    Afghanistan AFG AF South Asia SAS 1986 6.12245 3.50E+07 1.30E+09
    Afghanistan AFG AF South Asia SAS 1987 13.4997 1.10E+07 1.10E+09
    Afghanistan AFG AF South Asia SAS 1988 8.83333 1.20E+07 8.50E+08
    Afghanistan AFG AF South Asia SAS 1989 10.9284 7.90E+06 7.30E+08

Option 2: Wide to Long

I definitely recommend using wbopendata, but if you insist, here are instructions for starting with the Excel default downloads from World Databank.

  1. Use CSV files instead of Excel. You can do this either by selecting Download Options > CSV in World Bank Databank before downloading, or, if you already have an Excel file, opening it in Excel and saving as a CSV. This is a simpler file format that creates fewer problems when importing into Stata. While you are in Excel, delete those last few lines that aren't actually data ("Data from database: World Development Indicators," etc.).
  2. Stata's import delimited text data menu. Check whether the first row will be the variable names, and whether the columns are importing as numeric values.

    In Stata, go to File > Import > Text data (delimited, csv, etc.). This will bring up the import menu.

    First, use the Browse button to find your CSV file. Next, make sure the drop down under "Use first row for variable names" is set to Always. Finally, in the data preview at the bottom of the menu, scroll over to look at your variables for each year. Because of those pesky double-period missing values, they will import as strings, which is why they are in red. You want them in numeric format. To fix this, click on the column so that they are highlighted in blue (hold down the Ctrl key to select more than one at once), then right-click and select "Force selected columns to use numeric types." Now they should turn black, indicating a numeric variable.

    Now click OK. The data should now import.

    The command now in the Review window should like something like this:

    import delimited "C:\Users\yourusername\Downloads\Data_Extract_From_World_Development_Indicators.csv",
     varnames(1) numericcols(5 6) clear
  3. Now you can look at your data in the Data Editor and check that everything looks as you expect. If you neglected to convert the year variables to numeric type, they will show up in red in the Data Editor. This can be fixed with a command like so:
    destring yr2009, replace force
  4. Let's make sure we deleted those stray rows at the bottom that aren't really data.
    drop if countrycode==""
  5. You still need to reshape the data so that the years are in their own variable, rather than in the variable names. This is the purpose of the reshape command. The help document for reshape gives a useful illustration (hint: type "help reshape" in the Command window). In our case, the command will look like:
    reshape long yr, i( countryname countrycode seriesname seriescode ) j(year)
    
    

    yr comes from yr2000, yr2001, etc.: it is a bit of text that every numeric variable starts with. This is the "stub." The year information should be the second half of the variable name. In i(), you'll put all the identifying variables that uniquely identify every country and year. In j(), you'll put the name of the new variable you are creating. It can be anything, but in this case year makes the most sense.

  6. Now our data is long in the extreme. Take a look:

    countryname countrycode seriesname seriescode year yr
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2007 48.888
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2008 48.473
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2009 48.018
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2010 47.529
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2011 47.018
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2012 46.499
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2013 45.985
    Angola AGO Birth rate, crude (per 1,000 people) SP.DYN.CBRT.IN 2014 45.483

    If you only have one series, this is fine. You can drop the seriesname and seriescode variables and rename yr to the series name. You can see how you could do all these steps for multiple series and then merge them together on country and year variables. However, if you have multiple series in this file, you have a problem, because it is too long. You want each of these series to have their own column. We can do this by reshaping again.
  7. First let's use encode to turn the seriesname variable into a numeric variable with value labels.

    encode seriesname, gen(series)
    codebook series
    
    codebook will tell you what these new numeric variables are, like so:
    ------------------------------------------------------------------------------------------------
    series                                                                               Series Name
    ------------------------------------------------------------------------------------------------
    
                      type:  numeric (long)
                     label:  series
    
                     range:  [1,2]                        units:  1
             unique values:  2                        missing .:  0/960
    
                tabulation:  Freq.   Numeric  Label
                               480         1  Birth rate, crude (per 1,000
                                              people)
                               480         2  Debt service on external debt,
                                              long-term (TDS, current US$)
    
    
    A little more housekeeping: let's drop the seriesname and seriescode, because we don't need them. And, let's rename yr to var, because it doesn't contain the year, actually.
    drop seriesname seriescode
    rename yr var
  8. Now, we reshape yet again. (The last time?) 
    reshape wide var, i( countryname countrycode year) j( series)
    
    Now, the data look great.
    countryname countrycode year var1 var2
    Angola AGO 2007 48.888 4.40E+09
    Angola AGO 2008 48.473 1.60E+09
    Angola AGO 2009 48.018 3.50E+09
    Angola AGO 2010 47.529 2.30E+09
    Angola AGO 2011 47.018 2.80E+09
    Angola AGO 2012 46.499 4.20E+09
    Angola AGO 2013 45.985 4.60E+09
    Angola AGO 2014 45.483 5.90E+09

    Refer back to the codebook output to understand which variable is which. You can easily rename the variables with the rename command.

Option 3: Download in Long Format

Many do not realize that the World Bank DataBank has advanced download options that let you select the long format from the beginning. If you are downloading more than one series, however, you will still need to reshape the data a bit to get separate columns for each series.

  1. Select Download options > Advanced OptionsAfter you've selected the countries, series, and years you'd like to download, click on Download options in the upper right, and select "Advanced options."
  2. Now, change the Data Format to List (this means long) and the NA Preference to Blank. I recommend using CSV files instead of Excel files as well.
  3. If you've downloaded more than one series, once you've imported this into Stata, you'll need to follow step 6-8 from Option 2 above to reshape the data. But, you've saved some time by using the advanced download options.