[go: up one dir, main page]

0% found this document useful (0 votes)
77 views1 page

Data Transformation

This document provides information on data transformation and manipulation techniques in Stata. It discusses reshaping data from wide to long format using the reshape command. It also covers selecting, filtering, and replacing parts of data through commands like drop, keep, and replace. Finally, it shows how to combine data by appending new observations to an existing dataset.

Uploaded by

light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views1 page

Data Transformation

This document provides information on data transformation and manipulation techniques in Stata. It discusses reshaping data from wide to long format using the reshape command. It also covers selecting, filtering, and replacing parts of data through commands like drop, keep, and replace. Finally, it shows how to combine data by appending new observations to an existing dataset.

Uploaded by

light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Data Transformation Reshape Data Manipulate Strings

with Stata 14.1 Cheat Sheet webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data GET STRING PROPERTIES
For more info see Stata’s reference manual (stata.com) webuse "coffeeMaize.dta" load demo dataset display length("This string has 29 characters")
MELT DATA (WIDE → LONG) return the length of the string
Select Parts of Data (Subsetting) reshape variables starting unique id create new variable which captures charlist make * user-defined package
with coffee and maize variable (key) the info in the column names
SELECT SPECIFIC COLUMNS display the set of unique characters within a string
drop make reshape long coffee@ maize@, i(country) j(year) new variable display strpos("Stata", "a")
remove the 'make' variable convert a wide dataset to long return the position in Stata where a is first found
TIDY DATASETS have
keep make price WIDE LONG (TIDY) each observation FIND MATCHING STRINGS
opposite of drop; keep only variables 'make' and 'price' country coffee coffee maize maize melt country year coffee maize
in its own row and display strmatch("123.89", "1??.?9")
2011 2012 2011 2012 Malawi 2011
FILTER SPECIFIC ROWS Malawi Malawi 2012 each variable in its return true (1) or false (0) if string matches pattern
drop if mpg < 20 drop in 1/4 Rwanda Rwanda 2011 own column. display substr("Stata", 3, 5)
Rwanda 2012
drop observations based on a condition (left)
Uganda cast Uganda 2011
return the string located between characters 3-5
or rows 1-4 (right) CAST DATA (LONG → WIDE)
Uganda 2012
When datasets are list make if regexm(make, "[0-9]")
keep in 1/30 what will be create new variables tidy, they have a list observations where make matches the regular
opposite of drop; keep only rows 1-30 create new variables named unique id with the year added consistent, expression (here, records that contain a number)
keep if inrange(price, 5000, 10000) coffee2011, maize2012... variable (key) to the column name standard format
that is easier to list if regexm(make, "(Cad.|Chev.|Datsun)")
keep values of price between $5,000 – $10,000 (inclusive) reshape wide coffee maize, i(country) j(year) return all observations where make contains
keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru") manipulate and
convert a long dataset to wide analyze. "Cad.", "Chev." or "Datsun"
keep the specified values of make compare the given list against the first word in make
xpose, clear varname
sample 25 transpose rows and columns of data, clearing the data and saving list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun")
sample 25% of the observations in the dataset old column names as a new variable called "_varname" return all observations where the first word of the
(use set seed # command for reproducible sampling) make variable contains the listed words
Replace Parts of Data Combine Data TRANSFORM STRINGS
display regexr("My string", "My", "Your")
CHANGE COLUMN NAMES ADDING (APPENDING) NEW DATA replace string1 ("My") with string2 ("Your")
rename (rep78 foreign) (repairRecord carType) id blue pink
webuse coffeeMaize2.dta, clear replace make = subinstr(make, "Cad.", "Cadillac", 1)
rename one or multiple variables id blue pink save coffeeMaize2.dta, replace load demo data replace first occurrence of "Cad." with Cadillac
should webuse coffeeMaize.dta, clear in the make variable
CHANGE ROW VALUES contain
replace price = 5000 if price < 5000 + the same
variables
append using "coffeeMaize2.dta", gen(filenum) display stritrim(" Too much Space")
id blue pink
(columns) add observations from "coffeeMaize2.dta" to replace consecutive spaces with a single space
replace all values of price that are less than $5,000 with 5000 current data and create variable "filenum" to display trim(" leading / trailing spaces ")
recode price (0 / 5000 = 5000) track the origin of each observation
remove extra spaces before and after a string
change all prices less than 5000 to be $5,000
MERGING TWO DATASETS TOGETHER display strlower("STATA should not be ALL-CAPS")
recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2) webuse ind_age.dta, clear
save ind_age.dta, replace change string case; see also strupper, strproper
change the values and value labels then store in a new must contain a
ONE-TO-ONE
variable, foreign2 common variable webuse ind_ag.dta, clear display strtoname("1Var name")
id blue pink (id) id brown
REPLACE MISSING VALUES
id blue pink brown _merge
merge 1:1 id using "ind_age.dta" convert string to Stata-compatible variable name
+ =
3
one-to-one merge of "ind_age.dta" display real("100")
mvdecode _all, mv(9999) useful for cleaning survey datasets 3
3 into the loaded dataset and create convert string to a numeric or missing value
replace the number 9999 with missing value in all variables
variable "_merge" to track the origin
MANY-TO-ONE
mvencode _all, mv(9999) useful for exporting data
replace missing values with the number 9999 for all variables
Save & Export Data
id blue pink id brown id blue pink brown _merge
webuse hh2.dta, clear
save hh2.dta, replace compress
=
3

Label Data + .
3
webuse ind2.dta, clear compress data in memory
1
save "myData.dta", replace Stata 12-compatible file

Value labels map string descriptions to numbers. They allow the _merge code
1 row only
3
3
merge m:1 hid using "hh2.dta" saveold "myData.dta", replace version(12)
underlying data to be numeric (making logical tests simpler) (master) in ind2
2 row only
. 1 many-to-one merge of "hh2.dta" save data in Stata format, replacing the data if
while also connecting the values to human-understandable text. (using) in hh2 . . 2 into the loaded dataset and create a file with same name exists
label define myLabel 0 "US" 1 "Not US"
3 row in
(match) both variable "_merge" to track the origin export excel "myData.xls", /*
label values foreign myLabel FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID */ firstrow(variables) replace
define a label and apply it the values in foreign export data as an Excel file (.xls) with the
reclink match records from different data sets using probabilistic matching ssc install reclink variable names as the first row
label list note: data note here jarowinkler create distance measure for similarity between two strings ssc install jarowinkler export delimited "myData.csv", delimiter(",") replace
list all labels within the dataset place note in dataset export data as a comma-delimited file (.csv)
Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016
follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0

You might also like