*Data transformation

Transform variables

There are three central commands to transform variables or compute new variables from other variables:

generate Generates new variables, egen Extensions to generate
replace Replaces variables
recode to recode categorical variables (details here)

generate and replace

General syntax

generate newvar[:lblname] =exp [if] [in]
replace oldvar =exp [if] [in]

Where newvar is a variable to be created, oldvar an existing variable and exp an expression of any complexity using

Variables
Constants and system variables (e.g. _N (total number of observations in a data set), _n (number of current observations) or _pi
Functions (help functions)
Operators:
- Arithmetic: +, -, *, / and ^ (raise to a power);
- Relations == (equal), != or ~= (not equal) and >, < , >= (≥), >= (≤)
- Logical operators & (and), | (or) and ! or ~ (not)
- String: + (concatenation)

generate a1=urb+22-log(infmor)
generate b1=12	Constant value of 12
generate b2= .	Set all of b2 to missing
replace b=log(urb) if urb≥60	Replace values if urbanisation has some particular value
replace age2 = age2^2
generate c=uniform()	Create a uniform random variable

A large number of arithmetic functions are available; here's a partial list:

sqrt(x), sum(x), max(x1,x2,...,xn), min(x1,x2,...,xn) , sign(x)
Rounding and truncating functions: abs() (absolute value), ceil(x), floor(x) trunc(x) [or int(x)], mod(x,y), round(x,y) or round(x))
Trigonometric functions like sin(), cos(), acos(),atanh() etc.
Logarithmic functions: ln() (or log()), log10(), exp(). logit() [log of the odds ratio]
missing() [true (1) if missing]

Advanced functions

Several functions are shortcuts simplifying commond tasks that could be achieved with a sequence of generate/replace commands.

Recoding functions autocode(x,n,x0,x1), recode(x,x1,x2,...,xn) (examples of use in Create groups)
clip(x,a,b) returns x if x is in the interval a > x > b, a if x is below, b if it is above
cond(x,a,b,c) (c can be omitted): if x is true, the result is a, if false the result is b, if it is missing the result is c.
inrange(z,a,b) 1 (true) if z is in the interval a > z > b, 0 (false) otherwise
Retrieving results saved by commanjds: e(name) and r(name)[value of a saved result],

egen(" extensions to generate")

This command offers a number of useful functions (some of them are documented below). The general syntax is:

egen newvar = efn(exp) [if] [in] [, options]

Where efn is one of the functions offered by egen (see a partial list below) and exp is an expression, often a simple variable name.

egen urbm=median(urb)	Creates a constant containing for each observation the median of urb
egen urbsd=sd(log(urb))	Contstant containing the standard deviations of the logged urb
egen urb1=mean(urb) if urb > 70	Set all values of urb1 to the mean of urb if urb is larger than 70 (assumes that urb1 already exists)

Functions returning a constant:

min(exp)	Minimum	max(exp)	Maximum
mean(exp)	Mean	sd(exp)	Standard deviation
mdev(exp)	Mean absolute deviation	median(exp),	Median
mad(exp)	median absolute deviation	iqr(exp)	Interquartile range
pctile(exp) [, p(#)]	Percentile, p defaults to 50	count(exp)	Count the number of non missing values
total(exp)	sum of observations	total(exp)	sum of observations

Functions transforming a variable

std(exp) [, mean(#) std(#)]	Standardize, mean and std default to 0 1and 1
pc(exp)	Percentages (proportions/franctions with the , prop option
rank(exp) , unique	transforms into ranks (with ties witout the ,unique option

Functions operating accross variables

These function create a new variable with statistics obtained for each observation accross a variable list.

rowmax(varlist)		rowmin(varlist)
rowmean(varlist)		rowmedian(varlist)
rowsd(varlist)		rowtotal(varlist)
rowpctile(varlist) [, p(#)]		rowmiss(varlist)	Count of missing values
rownonmiss(varlist)	Count of non missing values

The cut (recoding) function is detailed in Create groups

Missing values

Consider the following two commands:

    generate nw=(v1+v2+v3+v4)/4
    egen nw=(rowmean(v1 v2 v3 v4)

If there are no missing values the results of the two commands will be the same; if however one or more values are missing from an observation the results will differ. In the first case a missing value will be generated for nw, in the second example an average will be computed for the non-missing values.

Numbers of missing and non-missing observations

count(exp) counts the number of non-missing values in a variable. To obtain the number of missing values you can use the following:

egen c=count(urb) Count the number of non-missing observation in urb

display c-_N Display the difference between "c" and the total number of observations in the dataset (_N is a system constant)

rowmiss(varlist) and rownonmiss(varlist) can be used to inspect missing/non missing observations across several variables.

Related commands

range varname #first #last [#obs] Create a variable, e.g range xx 1 _N generates values 1,2,3 ... up to the number of observations in the current data set.

egen c=count(urb)	Count the number of non-missing observation in urb
display c-_N	Display the difference between "c" and the total number of observations in the dataset (_N is a system constant)