There are three central commands to transform variables or compute new variables from other variables:
General syntax
generate newvar[:lblname] =exp [if] [in] replace oldvar =exp [if] [in]
Where newvar is a variable to be created, oldvar an existing variable and exp an expression of any complexity using
generate a1=urb+22-log(infmor) | |
generate b1=12 | Constant value of 12 |
generate b2= . | Set all of b2 to missing |
replace b=log(urb) if urb≥60 | Replace values if urbanisation has some particular value |
replace age2 = age2^2 | |
generate c=uniform() | Create a uniform random variable |
A large number of arithmetic functions are available; here's a partial list:
Several functions are shortcuts simplifying commond tasks that could be achieved with a sequence of generate/replace commands.
This command offers a number of useful functions (some of them are documented below). The general syntax is:
egen newvar = efn(exp) [if] [in] [, options]
Where efn is one of the functions offered by egen (see a partial list below) and exp is an expression, often a simple variable name.
egen urbm=median(urb) | Creates a constant containing for each observation the median of urb |
egen urbsd=sd(log(urb)) | Contstant containing the standard deviations of the logged urb |
egen urb1=mean(urb) if urb > 70 | Set all values of urb1 to the mean of urb if urb is larger than 70 (assumes that urb1 already exists) |
Functions returning a constant:
min(exp) | Minimum | max(exp) | Maximum |
mean(exp) | Mean | sd(exp) | Standard deviation |
mdev(exp) | Mean absolute deviation | median(exp), | Median |
mad(exp) | median absolute deviation | iqr(exp) | Interquartile range |
pctile(exp) [, p(#)] | Percentile, p defaults to 50 | count(exp) | Count the number of non missing values |
total(exp) | sum of observations | total(exp) | sum of observations |
std(exp) [, mean(#) std(#)] | Standardize, mean and std default to 0 1and 1 |
pc(exp) | Percentages (proportions/franctions with the , prop option |
rank(exp) , unique | transforms into ranks (with ties witout the ,unique option |
These function create a new variable with statistics obtained for each observation accross a variable list.
rowmax(varlist) | rowmin(varlist) | ||
rowmean(varlist) | rowmedian(varlist) | ||
rowsd(varlist) | rowtotal(varlist) | ||
rowpctile(varlist) [, p(#)] | rowmiss(varlist) | Count of missing values | |
rownonmiss(varlist) | Count of non missing values |
Consider the following two commands:
generate nw=(v1+v2+v3+v4)/4 egen nw=(rowmean(v1 v2 v3 v4)
If there are no missing values the results of the two commands will be the same; if however one or more values are missing from an observation the results will differ. In the first case a missing value will be generated for nw, in the second example an average will be computed for the non-missing values.
count(exp) counts the number of non-missing values in a variable. To obtain the number of missing values you can use the following:
egen c=count(urb) | Count the number of non-missing observation in urb |
display c-_N | Display the difference between "c" and the total number of observations in the dataset (_N is a system constant) |
rowmiss(varlist) and rownonmiss(varlist) can be used to inspect missing/non missing observations across several variables.