Glossary
Alphabetical list
This glossary defines - in alphabetical order - the most
important concepts.
- Adjacent value:
-
is the value closest, but still inside, the
inner fences. (This is where the end of the whiskers are set
on a boxplot). Some commands (e.g. the BOXPLOT command will
identify the adjacent values (low and high adjacent value).
- Adjacent values
-
by extension we call adjacent values any value
lying between the hinges and the inner fences. On coded displays
these values often appear as '+' (upper adjacent values) and '-'
(low adjacent values).
In order contextes low/high are not distinguished; then '+' is
used for both. Note that these codes can be different in
you EDA installation. (STAT GRAPH DISTCODES displays them; SET
GRAPH DISTCODE lets you change them).
See also outliers (out and far out).
- Allvars mode:
-
(Usually default mode) Whenever EDA is in ALLVARS
mode and a multivariate command like FACTOR is used and no
variable list is present, EDA takes includes all variables in the WA
into analysis. This of course requires that the WA be rectangular.
Commands using this feature are called "allvars sensitive" commands.
The SET ALLVARS command is used to change the current state.
If ALLVARS is OFF these commands work like normal EDA commands, i.e.
the current vlist is used. If ALLVARS is ON and a vlist is present
EDA behaves as in ALLVARS OFF mode, i.e. the vlist overrides
the mode.
(A synonym for WA) A block is a WA seen as an entity
for i/o operations. A complete WA may be saved into a file
or loaded from a file. A block has a block label, block
descriptor and block status information.
- Case selection commands:
-
See <Selection commands>
- Casid
-
An alphanumeric case identifier attached to each case
in the WA. Default casid is the (original) sequence number
of the cases. Other casid might be built in (currently
swiss cantons, europ. regions) and added using the CASID
command. Any other casid can be read in and stored.
These 4 character casids may be used in named values,
matrix references or simple expressions. See the CASID command
for more details. Casids should not contain blanks nor special symbols,
to make sure the correct sure
correct substitution in all instances,
Casids defined by the user are automatically stored in a system
file when created; an additional possibility is to store
frequently used casids (and maps) in a casid/map library.
(see there).
- CENTER
-
A CENTER or reference
value is stored with each variable (variable attribute).
When you create a new variable,
by default the median will be
computed and stored as reference values.
A number of commands use a CENTER option (e.g. some coding options)
to refer to this value. This allows easy comparisons with global
values (e.g. global percentages etc).
Several commands (DISP, PERCENT) modify this center value.
Refrenence values are stored into system files (PUT) and
restored by a GET command.
There is also a label and descriptor attached to the CENTER, this
label/descriptor is global to a whole WA, i.e. it is not always
meaningful for all variables.
The CENTER command is used to manage this attribute.
Note that reference value and center are synonyms.
- Coding
-
(coded displays)
any display where some symbols are shown instead
of the exact numeric value.
Various forms are frequently used in EDA, e.g.
codes showing whether a value is a far-out, out,
adjacent or in value, codes corresponding to specific intervals
of a variable etc.
or symbols ('+' or '-') representing a
value in terms of deviation from some central value, e.g.
++ might indicate that the value is two mid-spreads away from
the median of its distribution.
See the section on the "Art of coding" in this chapter.
- Column name:
-
The columns (cases) of the data matrix may have
a name. Default name is 'case'. Other names, e.g. 'state'
may be specified with the SET COLNAM command.
- Command files:
-
Instead of reading commands from the
keyboard, commands can be entered from a file containing
these commands.
A file named EDAINI is automatically executed when EDA is called
(startup file).
- Configurations:
-
two distinct EDA matrices C1 and C2 mainly
used to store and manipulate configurations produced by
multidimensional techniques. See C1 and C2.
- Control characters(*):
-
Sometimes you need to specify a character not available on your
keyboard (e.g. graphical characters) for symbols used in coded
lists or with the MOD tool. Whenever the manual states that
control characters can be used, the following rules apply.
EDA has a special symbol used to announce a control character, i.e.
~ (tilda) by default. This character may be followed either by a
single letter, e.g. ~A (specifies control-A) or by a three digit
decimal number, e.g. ~027 (note that three digits are always required.
If you need to insert the tilda (or its replacement) type ~~ to
produce a single ~. Please note, that this facility is only
available, where mentionned in the manual. In all other cases
no translation is performed.
- Control options
-
are used with expressions and macros; they
provide control of the macro or expression through an index.
LET #A=#A%#TOTAL \ FOR A START=10 END=20 executes the
expression 11 times; at each step A is incremented by one.
See the LET/CALC/OUT commands, as well as the LOOP and EXECUTE commands
in the macro section.
- C1
-
The configuration area contains coordinates
produced by FACTOR and other multidimensional techniques.
This area can be manipulated by the C1 and ROTATE
commands. There are also labels attached to each
coordinate.
These labels may be case or variable oriented, depending upon
the data stored in it. If they are produced by any
multidimensional technique they are variable oriented (C2 then
is case oriented), but configurations may be stored from other
sources and they also may be exchanged.
When loading a new WA the C1 matrix is not cleared.
- C2
-
a second configuration matrix, which usually holds
individual scores or another configuration from techniques
producing to configuration matrices, or techniques like
configuration fitting and comparison which need two
When loading a new WA the C2 matrix is not cleared.
See also C1.
- Data matrix
-
The (numeric) data to be analyzed, i.e. the WA
in a narrower sense (without the matrices and documents
etc) as a data matrix with a certain number of rows and
columns.
- Data range checking
-
Data entered from the keyboard may be
submitted to an automatic checking of the range of the data, i.e. for
percent data values outside 0-100 should be rejected. This
feature may be activated with the SET command. by blanks or
commas.
- Defined variables,
-
See letter variables
- Descriptor:
-
See entry variable descriptor.
In addition to variable descriptors there are also WA descriptors,
descriptors for the C1, C2 and MATRIX matrices, as well as
descriptors for the GVAR and the currently defined variable ties.
All descriptors take the form of a up to 48 character long short
sentence describing the contents of the matrix.
- Document:
-
A document is a text of any length attached to a
variable, which can be retrieved using the DOC command.
This is an optional feature and may not be present. There
is also a HEADER and a NOTE explaining the nature etc of
the data in the WA. Other named documents may also be
included (#docs). Case documents are documents referring
to specific cases. The may have a meaning either for a
variable only or for a whole WA. See the special section
on Documents [AM] and the DOC command.
- EDA-file
-
EDA specific files, where the program retrieves or
stores complete WAs, i.e. blocks. These files are
sequential access (SAM files). Other
files, which do not have an EDA specific format are called
external files.
- EDAINI
-
See command files.
- EDITor:
-
A special module within EDA with its own syntax
analysis for editing and recoding purposes. The editor
works in two modes: either the editor is entered using
the EDIT command or using an edit command preceeded by /
meaning immediate mode, i.e after execution of the command
control returns to normal EDA mode.
There is also a text editor within EDA, called TED.
- Expressions, arithmetic:
-
a algebraic expression following
usual rules for operators and precedence, with a special
notation for variables. (See also simple-expression).
- Extreme values
-
--> outliers
- far out values -->
-
outliers
- File names:
-
file names within EDA are usually specified within
"" (i.e. as names). The file name is specified as an external
file name, i.e. should correspond to the conventions on your
system.
- Freefield input:
-
in some instances the program requires
numeric input from the keyboard, which can be entered in
freefield, i.e. without a specific format. If the number
of entries is not determined by the nature of the command,
a // sign is used to signal the end of input. Acceptable
items must conform to the definition of simple
expressions. Items are separated by blanks or commas. If
an error occurs during data entry the user is asked to
enter a replacement value for the value in error (or to
cancel, i.e. to abort the current command). Normally data
entry is submitted to range checking. In some instances
items may be skipped by specifying several commas, then a
default value is supplied (more details will be found with
the specific commands.
- Format:
-
Some I/O routines dealing with formatted raw data,
can use Fortran-type formats. A format string is enclosed in
parenthesis and may contain an A4 element for casids
(first element) followed by any format element pertaining
to real data (all data in EDA are real and single
precision).
See also *READ RAW (documented files) for alternatives.
(Usually people with no experience with Fortran programming
have difficulties with formats....).
- Fuzz
-
is a system defined value used in comparisons as criterion
to determine whether a value is identical to another or not, a
value is identical when it lies in the interval
(value-fuzz) < value < (value+fuzz)
Many commands dealing with this kind of problems have an option FUZZ=val
used to set the fuzz value for that particular command.
There is also a global fuzz value used, unless the specific >FUZZ= option
is used. See SET FUZZ for details.
- Global options:
-
several options apply to all commands without
any distinction. Currently these are /d which inhibit the
display of the results of the current command and /p inhibiting
that output be written to the print file.
The /S (stack) switch is useful in macros.
See the section on
global options for more details.
- Groups
-
Cases can be grouped together using a grouping
variable called GVAR, which can be defined and manipulated
by the user.
Many commands are offered to produce groups (like cluster analysis,
coding etc) and even more commands display group memberships (lists,
histograms, plots etc.).
Groups are shown either as numbers (group numbers) or, when defined,
names whenever this is possible.
See the GVAR command for additional details.
Compare also to TIES.
- Group names
-
Groups may have names (See the GVAR command on how to define names).
Currently the number of groups which may have names is limited to
MAXC (an implementation constant, often 8), but this seems quite
enough for most situations where you are prepared to give each
individual group a specific name. If no name is defined automatic
group names are generated: Group nn, where nn is the group number.
- GVAR:
-
stands for Grouping variable (see group)
The GVAR has a descriptor attached to it, describing its origin.
GVAR is a global attribute of a WA, i.e. it is the same for
all variables. GVARS are produced by many commands and can manipulated
explicitely with the GVAR command.
- High
-
(values etc) refers to values usually higher (larger) than the median
in the distribution. Often we will also say upper.... or use the
abreviation HI, eg. HI adjacent value, upper hinges etc.
See also low(er).
- Hinges:
-
Letter values at depth 1/4, roughly the quartiles
sometimes called fourth. The Hinges are the endpoints of
the box of a box and whisker plot.
- Implementation
-
the process of adapting the EDA program to a
specific computer AND user environment. Implementation
dependent features means that these features might be
different from the description in this manual. For these
the user should refer to the chapter on implementations in
this manual and to the local document describing the
differences. (See also HELP SYSTEM)
- Implementation parameters:
-
(NVAR, MCAS) Parameters specified
on creating a specific EDA implementation, NVAR is the
number (maximum) of variables in the WA; MCAS the maximal
number of cases a variable can have. These parameters
determine the program limitations. Further implementation
parameters are MAXC, the maximal number of cluster, MXDIM
the maximal number of dimensions (factor etc) and MBL, the
maximum number of blocks stored in a direct access
EDA-file.
- Installation, EDA installation
-
EDA has to be installed onto your computer. As EDA offers many
features for workgroups, teaching oriented options and so forth,
installation may vary quite a lot, i.e. the environment you are
actually using depends on how things have been installed. This is
done through profile files.
- In-values:
-
values within the hinges, as opposed to out, far out
and adjacent values. Frequently shown as '*' or blank on coded
displays. Note that this symbol might be different in your
EDA version (See SET GRAPH DISTCODE).
- Intrinsic functions
-
(obsolete) In EDA versions earlier than version 2.0 the ) symbol
was used to refer to system constants and the like. This feature
has been replaced by system constants (starting with a $).
See System constants for more information.
- Label:
-
See variable label. Note that in addition to variables in the WA, the
variable oriented matrix stored into C1 and MATRIX has a separate set
of labels.
- labels and descriptors
-
In many situations you will enter the label and the descriptor of a
variable at the same time, therefore the term "labels and descriptors"
is used to tell the user to enter descriptive documentation for
a variable, often on the same line, starting with the label
followed by the descriptor, the label being the first "word" of the
sentence.
- Ladder of powers
-
See power transformations.
- Load, loading
-
This operations always refers to copying some data into the work area
from matrices like C1, C2, MATRIX or the GVAR. A message telling you
that the GVAR has been loaded, means that the current GVAR has been
loaded as normal variable into the WA. Compare with STORE.
- Letter variable
-
(defined scalar variables)
single letter variables A..Z can be defined within EDA and then used in
variable references or option values, e.g. A=10. There are three types:
constants, auto-increment variables (i.e. after each reference the value
is automatically incremented) and indexed variables (i.e. indexed on a
numeric variable in the WA). See also ResVars (Result variables) a
single letter scalar variable A..Z.
- Logging, command_log
-
You may ask EDA to keep all keyboard
input in a file called a log file (this action is called
logging). Default is NOT to log commands, i.e. you will
have to turn this feature explicitly on. See SET LOG and
the section on logging (chapter file connection).
- Low, lower, LO
-
refers to the position of an observation in the distributution with
respect to the median (or some other reference), e.g. lower hinge
LO adjacent value(s) and so on. Opposite upper or higher.
- Macro, line macro:
-
a repetitive execution of an EDA command
specified on one line using the EXECUTE facility or defined
with the DEFMAC command allowing to invoke them by name.
A macro command with more than one command line can be
defined using the MACRO command. See the specific
commands as well as the chapter on macros for more
details.
A single line macro is also called an abbreviation.
- Map:
-
EDA has the possibility to display simple maps on a
character screen. There are two types of maps: built-in
maps, i.e. maps which are part of the EDA system. EDA is
supplied with two of these maps: Swiss cantons and CEE
regions, but this might be different at your installation.
The second type of maps are called user defined maps; these
are maps stored under a very simple form either in a normal
external file, with a WA on an EDA file (in this case
the map is automatically made into the current active map
whenever the corresponding WA is read from that file) or stored
in a casid/map library.
For more details see the MAP command, as well as the
appendix, where you will find information on how to prepare
such a map.
The link between the data and any such map is established through
the casids; in fact the casid id and the map id must be the
same in order to get a map on the screen.
- Marcom
-
a special type of analysis which deals with MARginal
COMParisions between two groups (e.g. elite vs. population).
- Matrices:
-
The different EDA matrices: WA, MATRIX, C1, C2
and/or a matrix in the mathematical sense.
- MATRIX
-
Distance or similarity matrix stored by CORREL,
FACTOR and other commands. It may be manipulated by the
MATRIX command.
When loading a new WA the MATRIX is not cleared.
- Mode
-
assumed modes determining working conditions of some
commands (Analysis on whole WA, error termination of
macros etc). These conditions are controlled by the
ASSUME command.
- Modification stamp
-
If variables are altered using
arithmetic or other transformations, a *c* mark is added
at the end of the variable descriptor and in most
instances the variable descriptor is modified to show the
modification done to the variable. The table below shows
the originators of the stamps.
stamp possible originators
---------------------------------------
*r* recodification (RECODE, COPY, PUT)
*t* transformation (arithmetic)
*s* 1. standardize/normalize
2. smooth (more info in descriptor)
*c* LET using a bracket target
*e* editor: case editing
*i* ICAS or AGROUP
*d* DCAS or DGROUP
*+* CLUSTER: group centroids added
*a* AGGREGATE
*%* PERCENT
These stamps are a visible mark that a variable has been
modified and the descriptor has not been modified
accordingly. The descriptor may not need modification,
e.g. in the case of a correction of an error or the like,
then the edit command CLDE for clear_descriptor may be
used to clear the stamp. The edit command SCAN allows to
search variables with modification stamps and to complete
the descriptor or to clear the stamp.
Note that when a default label/descriptor is created by
some commands, e.g. the NEWVAR command, the last character
of the descriptor is the '*' character to signal incomplete
documentation of the variable.
- Module
-
Several specific tasks do not fit into the "normal"
EDA syntax frame.
Therefore they are group in a module with its own specific
syntax. This is the case for TED (text editor), the EDITor
(data editing) and the TOOLBOX. In order to use these modules
you have to call the module; you then enter it and EDA obeys
the specific syntactical rules of the module. In order to
return to normal mode, you have to leave (quit) the module.
The EDIT and TOOL module may also be called in immediate mode,
i.e. you execute a single command within a module and return
immediately to EDA mode.
- Outliers:
-
With EDA techniques attention is very often focused
on outlier detection and treatment. An outlier is a case having
a value outside the "normal" range, where normal is defined
according to some specified criterion. An often used criterion
is a value outside the inner fences, where the inner fences
are defined as one step outside the hinges and a step is simply
the midspread of the distribution. If we step out further say
by 1.5 steps from the hinges we define the outer fences.
Values between the inner and outer fences are called "out"
values, values outside the outer fences as "far out" values.
We shall also use "extreme values" meaning "out" and "far out",
i.e. all values outside the inner fences.
In this program the definition of outliers my be changed with
the SET DEFOUT command, where it is possible to act on the
inner and or outer fences definition.
Often we shall represent outliers with symbols. On Boxplots
far out values are marked with a "@", out values with a "0".
On coded displays the standard symbols used are '@' (High far out
values), '&' (low far out values), '#' (high out values) and '=' (low
out values). Sometimes low and high are not distinguished; then the
high symbol is used for low and high.
Note that the symbols might be different in your EDA version (use of
graphics characters selected by your EDA system administrator) or
defined in your profile. Furthermore you can change these symbols using
the SET GRAPH DISTCODE command.
- Namestrings
-
(option form 3): character string enclosed in "
of max length of 60 characters, used to specify file names, strings
to search for and the like.
Strings are case-sensitive. If you omit the closing " from a
string, the remainder of the command line will be included in the
string, i.e. no other option would be analyzed).
- Power transformations
-
Power transformations are essential to re-expressions. In various
commands they may be specified either with reference to the ladder
of powers (Tukey) using a vocabulary of the "move on step UP the
ladder" style or direct reference to the power, e.g. POWER=0.5
to take the square root transformation.
See the special section on power-transformations later in this
chapter and the description of commands like BOXPLOT LADDER,
REEXPRESSION, PLOT, TRACES etc.
- Print file (PF)
-
The PF is a special file, where results may
be kept for printing (or some other processing). When entering
EDA no results are kept, i.e. all results appear on the screen
and are then lost. In order to keep track of results, the user has
to open a print file (either in ALL or REQUEST mode: -> PRINT
command). Note that a print file may be also be opened, but
currently be inactive, i.e. nothing is saved until the print file
is reactivated.
- Profile (file)
-
Every time you are executing EDA you are creating a profile,
i.e. a file where EDA keeps information on your environment.
The information in your current profile may be copied from a number
of other profiles; normally there is a EDA system profile, i.e. a profile
for all users. Often there is a group profile, i.e. a group of persons
sharing a number of informations (.e.g. WA archives and the like).
Furthermore you may have your own permanent profile.
[Note that profiles need not exist at any level; then of course the
profile of your session will be rather simple].
The profile contains informations like the location of your WA directories,
the default settings of the main options (you may tailor them to your
needs), the location of your map directory etc. etc.
For a standard use of EDA profiles are not important to a user, but are
quite useful to group or system managers. Refer to the documentation
for details.
- Prompting
-
(1) on most systems (see implementations) EDA
prompts with a specific symbol for a new command. These symbols
are different in each EDA mode (normal mode, EDIT, TED etc).
Typically in normal EDA mode it prompts with the > symbol;
But this might be different on your machine.
(2) In many instances EDA asks for additional information with
an explanatory text or symbol. For example EDA might asks the
number of variables on a file, for labels or descriptors for
a new variable or for the lines of a macro. In this cases answer
with the information or cancel or stop (this will be clear from
the context). If you do not know what to reply, type a question
mark as first character on a line; then EDA should give you some
more explanation [Sorry, this will not yet work in all instances].
In many instances default values are admitted; then EDA will tell
you to which value(s) it defaults. This is done either with an
explanatory text or the value is shown followed by the * symbol.
The text "Does the file contain rawdata [Y*,N]" means
that if you respond by a simple carriage return or Y, the file
contains raw data, otherwise it doesn't. [Y,N] would mean that
there is no default value.
- Protected Variables
-
Variables are protected, i.e.
alteration is refused, only when data is input from the
keyboard. The REVERT command reverts the protection, i.e.
an unprotected variable loaded can be protected using
REVERT; a protected variable can be unprotected the same
way. (See REVERT)
If protected variables are present in a WA the WA protection is
automatically turned on. You may unprotect the WA and all variables
in that WA by using SET WAPROTECT OFF.
You should clearly distinguish between protected variables (an attribute
of a specific variable) and a protected WA (a global attribute).
You may delete an unprotected variable from a protected WA.
- protected WA
-
A WA is protected agains accidental overwriting either if it contains
protected variables or if the WA protection switch is set. See the
SET WAPROTECTION and SET SECURITY switches.
- RAWIN (file)
-
Designates the file uses for raw data input. RAWIN is also the default
generic file name if none is specified on raw data input operations.
In ordinary use it is mostly used with the *READ RAW command, however
macro commands exist to read and parse input lines.
See the SET RAWIN command for additional
information.
- RAWOUT (file)
-
Designates the file used for raw data, text and other output operations.
RAWOUT is also the default generic name. Besides the *WRITE command a number
of commands have options to add output to the RAWOUT file. Note that
this is not the print file, but a file intended to be read by other
applications, including EDA macros. See the SET RAWOUT command for additional
information.
- Reference (value)
-
Same as CENTER, a variable attribute. See the CENTER entry in this
glossary.
- Replacement value
-
(1) A numeric value standing for a
alphanumeric string (4 characters) used on the EXTRACT
command. EDA maintains a memory stack which keeps these
replacement values or (2) a numeric value (default -1)
used where a variable has to take a value and none is
given or is an arithmetically undefined variable. This is
the case of a division by zero, the EXTRACT command, as
well al the LABEL command (options used to extract numerical
information from labels or descriptors).
- Result Variables
-
ResVars are scalar results from commands which may be inspected and used
(mainly in macros). [Earlier they were called ZVARS] There are 10
ResVars you may use by writing $0 .. $9. $0 to $4 are integer values, $5
to $9 are real values.
These ResVars are specific to each command and are documented on-line.
Use HELP RESVARS to show the ResVars of the current command; HELP
RESVARS <cmd> to see the resvars of command <cmd>.
ResVars are initially set to 0, but are not reset between commands.
A specific command may set some ResVars or none at all; the others are
left untouched. Therefore it is essential to use (e.g. store them etc)
ResVars immediately
after the command defining them, unless you are really sure that the
intervening commands do no set ResVars.
Z$ is a string result variable. See Z$.
- Row names:
-
The row of the data matrix may have a name.
Default name is 'variable', other names may be set by the
SET ROWNAME command.
- Selection (commands)
-
let you analyze subsets of observations. Selection commands
do not alter the current WA; only selected cases are included
into analysis, until the selection is turned off, either
explicitely because a new selection has been specified or
implicitely because the selection does no longer make any sense
satisfying some specific conditions.
For more details
refer to the section on selection commands and groups.
- Simple expressions
-
may be specified where named values are
specified (option type II) or on data input (keyboard input
only.
A simple
expression (do not confound it with "expression") may
contain (1) numbers (2) letter variables (3) intrinsic
functions (4) case references (5) case substitution and
(5) at most one of the following operators: +,-, *,/ or %.
In some instances items (4) and (5) may not be allowed.
Examples: A+100 or 200/$NVAR.
Simple number, letter values
etc. without operator may be seen as a special case (only
one argument) of the simple expression.
- Simple logical expression
-
Some commands use a very simple form of logical expressions, i.e. a
special form of a command line option.
It takes the form IFoval, where IF identifies these expressions, <o> is
a logical operator (<,>,= or ~, for less than, greater than, equal or
not equal) and value for the comparison.
In fact, IF>20, is just a special case of the name value option form, where
other symbols (<,>,~) are permitted. This is clear when you use the
equality operator: IF=24.4!
Important: no spaces are allowed in a simple logical expression.
Note that this form can be found only with a few commands.
- startup file
-
See command files.
- store, storing
-
Refers to the operation of storing data or other information from the
WA into one of the matrices C1, C2, MATRIX or the GVAR, casids and the
like. The opposite is LOAD.
- String variables (scalar)
-
You may use a small number of string
variables (the number is implementation dependent): their names
are A$, B$, C$ etc., i.e. single letters, followed by the $ sign.
String variables may be used in connection with all command
input by invoking the substitution de the string variables:
$A$ is replaced by the current value of string variable A$
(the first $ means "substitute").
Initially the string variables
are defined as null strings, i.e. of no length;
(Note that on some systems string variables contain
information on the environment)
therefore if no
string is assigned to a string variable, the $A$ reference will
simply be removed from the command line. String variables are
handled with the SET command (see there for more details).
There is also a string result variable called Z$.
When using the string commands in macros special care has to
be taken to cause substitution at the right time (see macros
for more details). It is also preferable not to use string
variable A$ in macros (some commands use it in a special way).
Instead of a letter you may also use $<$ causing the string
to be read from the RAWIN file. RAWIN has to be connected
with the SET RAWIN command; otherwise an error occurs.
(This feature is explained with the RAWIN command --> section
on macros, as it is most useful for macro programmers.
- Tables
-
are a specific variable type. Table variables can only be used with
commands meant for table analysis and management. Tables are created
from variables using the MAKE TABLE command or from commands like the
XTAB or BREAK command.
- Ties
-
you may defined groups of variables (bundles, predefined variable lists,
variable groups). The tie defining group membership is an attribute of
a variable like the label, descriptor or reference value, except that a
variable tie need not be defined (no default). You refer to a variable
group using the #list-number convention on the variable list. The ties
are stored with the variable and saved like any other information
pertaining to the WA. The DESCRIBE command shows the group membership
(=tie) of each variable (blank if no tie is defined). Whenever a WA is
transposed a tie becomes a GVAR memberships, and GVAR memberhips are used
as ties. Ties may also be defined by a cluster analysis on groups.
- System (operating)
-
EDA runs on various operating systems; some features of EDA might
depend on capabilities of the operating system, i.e. features
marked as "system dependent" might or might not be available in
the actual EDA software you are running; for instance only the
PC version of EDA is able to read and write spreadsheet files.
- System constants
-
or constants may be used instead of specific values in various
places (options, expressions, input values). They are used by
invoking their substitution with a $ sign followed by its name
and in some cases additional information. See also ResVars, where
a $ preceeds a number.
There are three types of constants. The names are meaningful in their
first three characters. Additional characters are not analyzed up to
the next separator.
The first type contains
implementation constants or information on the sizes of various
matrices. They are used as is with no additional specification.
Name Explanation
-------------------------------------------------
$NVAR max. number of variables a WA may contain
$MCAS max. number of cases a WA may contain
$MXF max. number of dimensions C1 or C2 may contain
$NVR number of variables in the WA
$MDIM size of MATRIX
$C1V number of variables (coordinates) in C1
$C1D number of dimensions in C1
$C2V number of variables (cases, coordinates) in C2
$C2D number of dimensions in C2
$NGR number of groups in GVAR
$RVAL replacement value
$MISS replacement value
$VLN number of variables in the current vlist
$UDL upper range limit (data range)
$LDL lower range limit
$NSL number of cases selected (selection)
$NTT total number of cases (before selection)
$GET get a value from the user
A second type is used to inquire about specific variables, or a
specific location in the current variable list, therefore a variable
reference (name or number) is required.
These constants take the form:
$name.var
Where <name> is a constant name, and <var> is
is either a variable name or the position (number) in the WA.
With $VLS a variable name does not make sense, as it refers to
the current variable list, i.e. <var> points to var-th element of
that list.
Name
$NOC.var number of cases
$MIN.var minimum
$MAX.var maximum
$CEN.var center
$TIE.var variable tie
$PRT.var 1=protected, 0= unproteced
$TYP.var variable type
$VLS.pos variable at <pos> in the current vlist
Note that if you do not use variable names, any number within 1..$NVAR
will be accepted, even for empty variables.
The third type needs two arguments. Currently only $DAT exists:
$DAT.var.case
$DAT refers to a specific data value in the matrix. <var> refers to the
variable and has the same syntax as above.
<case> refers to a case (position,
casid or letter variable). Note that no checks are performed
if positional
numbers are used, provided they are in the range 1..$MCAS for the
case reference and 1 .. $NVAR for the variable reference.
- Terminal input
-
--> Freefield input
- Toolbox:
-
A separate module within EDA, where are to be found
several tools used in connection with EDA (defining special
files, computing matrices directly into MATRIX) and a series
of general purpose file handling commands (sort, merge-checking,
etc.).
- User
-
You, the person using the EDA Software. Here users are explorers,
eager to explore a data set and not afraid of using multiple commands.
If you are looking for automatic handling of some data analyis sequences
you are attempting
to use the wrong software package.
- User, advanced
-
A user who is not afraid of looking behind the scenes in order to
get things done the way he or she wants things to be done, a person
who is not afraid to read some technical stuff to have background
information or is prepared to learn how to write macros etc.
- User profile file:
-
The current
user profile defines the user's working environment.
When you call the EDA program, a user profile will be created
before you are able to enter the first command. This profile
may be minimal or include many contextual informations built from
a system wide profile, a group profile as well as permanent
user specific profile file. Ask your group administrator or the
person taking care of the EDA installation on your computer
for further information on profiles or help whenever you suspect
trouble linked to profiles (for instance files you no longer can
GET, commands working differently from what you were used to etc).
Note to advanced users and administrators: There is a special document
called the "Administrators Guide" which explains profiles in full
detail.
- Variable
-
a data array together with all the information
attached to it: label (8 characters), descriptor (48
characters), status information (type: 1= numeric
variable, 2= GVAR, 3=alpha/the number of cases and a
tie) and three associated values minimum, maximum and a reference
value (default= median). The user refers to a variable
using an integer number referring to the relative position
of the variable in the WA. Instead of numbers variable
labels or ties can be specified using the #
substitution. When storing a WA in
an EDA file variables are packed; i.e. variables are
stored consecutively with no empty variables in between.
The same can be done by the PACK command. Variables can
be protected or unprotected and have different usages
(type of variable).
You should very clearly distinguish letter variables, i.e. scalar
variables and "normal" variables, i.e. vectors.
- variable bundles
-
Variables tied together, using a TIE. (See there for details)
- Variable descriptor:
-
an up to 48 character long text describing the variable. The descriptor
may become modified automatically by transformation commands (encoding
of the transforming command) if the user does not supply a new
descriptor or a mark, called a modification stamp is added at the end of
the descriptor to signal that the variable has been modified. In some
instance, when the user does not supply a descriptor a default label and
descriptor a created, e.g. the NEWVAR command. These default
labels/descriptors are considered "incomplete documentation", e.g. its
meaning may easily been forgotten if the user comes back to that
variable only some time later. The SCAN (edit) command is used to scan
the WA for this type of variables and to correct the descriptor.
- Variable groups:
-
Variables may form groups using ties. See TIES form more details.
- Variable labels:
-
an up to 8 character labels. If you desire to use the labels as variable
references everywhere, including expressions, you should not use blanks
nor any special symbol within a label. If you are using single-letter
variable names in some instances you need to distinguish them from
letter variables a variable name: #A refers to a letter variables, #A'
to a variable stored in the WA.
- Variable list
-
The second field on a command line, where the user
specifies the variables to be analyzed. Variable lists may contain
integer reference numbers, variable names, references to a
predefined list, names with wildcards and letter variables. All
elements except the integer numbers are preceeded by a #
sign (the # symbol is considered a numerical symbol in EDA).
In some instances commands need several variable lists.
In such a case the "/" separates the different lists.
There are a number of commands you can be use to build
variable list using more or less complex criteria: based on
information in the labels or descriptors, statistical criteria in
the variables and the like.
These commands can be used to create a variable list; the subsequent
command then takes it up (you specify no variable list) and analyses
those variables.
>VARS 1-10 SORT MEDIAN
>LIST
In this example you variables are sorted on their median, i.e. the
VARS takes variables 1 through 10, and builds a new variable list
by computing the median and sorting these variables; they will then
appear on the LIST command: the first variable will be the variable
with the smallest median, the last with the largest.
- Variable type:
-
(variable attribute)
EDA uses different types of variables. Most variables are
normal numerical variables (type 1, nominal, ordinal or binary). There
are also table variables.
More types will be used in the future.
- Wildcard
-
in several instances (most frequently on variable lists)
you may specify a string (e.g. a label reference) containing
wildcard characters. A wildcard character tells EDA that in the
position(s) marked by such a character any character may be present
for a match. The EDA wildcard character is the '*' symbol, which
has two different meanings depending upon the position in the
string: (1) at the end of the string it means "match all"
(2) within a string "match any character".
VAR*
matches all variables starting with VAR, the characters beyond
do not matter.
***X
looks for any four-character long label, having an 'X' in position
4 and any 3 characters preceeding it.
Of course the two meanings may be combined:
***X*
- Toolbox:
-
an EDA module where a collection of tools can be found.
These tools are programs needed in connection with data analysis
in a large sense (data manipulation, dictionary creation,
report writing ... These tools are not necessarily EDA specific. They
are self-contained. This collection contains tools I need and use.
Other users might find them useful also, others won't.
- Work area:
-
(WA) an NVAR variables by MCAS cases
(implementation constant: see chapter on implementations)
data matrix, as well as all the information (labels,
documents, status) attached to it. Analyses within EDA are
always performed on variables residing in this WA. In a
larger sense the WA includes also the MATRIX and the
CONFIGURATION stored. For data saving/retrieving purposes
the WA is treated as a whole ("block"). I/O operations are
performed on whole blocks, with the exception of the *COPY
command. The WA need not be a rectangular data matrix.
The type of command use determines whether the whole WA
need be rectangular or not. The WA has a label and a
descriptor. Three values of importance to the user are
attached to the WA: The number of variables, the type of
the WA (1 if the WA is rectangular) and the number of
cases, if the WA is rectangular. Multivariate commands,
like FACTOR or CLUSTER operate on the whole WA or on the
variables in the vlist depending on the mode set by ASSUME
ALLVARS.
- Work area name
-
(waname) The name of a workarea is an up to 8
character long string. Upper and lower case letters are not
distinguished. The WA is mainly used for reference to a WA
stored in a file (GET command) or to store it into a WA
archive (PUT).
For other types of files the WA name is just provided for
documentation purposes and has no special meaning.
The work area name may often appear together with the WA
archive (directory) name; then the reference might take
the form DIREC:WANAME (=> files).
- Work area archive
-
A WA archive contains collections of WA; in order to access a
particular WA you need only to know the name of the WA, i.e.
the 8 character name known to EDA; system specific file names
are not required. Furthermore the DIR command allows you to
do sophisticated searching on information stored in order to
find out what WA contains the information you need.
Note that WA archives are an optional, i.e. they may not
be installed or active on your system. Use STAT WAARCHIVE
to find out.
- Work area directory
-
-> Work area archive
- Work area library
-
-> Work area archive
- XNAME
-
is used to set the designation for an X-variable in regression
type situations. By default the x-variables are called
"independent", but you might prefer to call them "predictor",
"explanatory" or whatever term is suitable.
The XNAME can be changed with the SET XNAME command or alternatively
it can be changed in the profile.
- YNAME
-
See XNAME for details. The default name is "dependent".
- ZVAR
-
ZVARS --> see ResVars (Result variables)
They are called ZVARS because in versions
earlier than 2.0 they were referred to by putting a Z in front of them.
- Z$
-
is then string Result variable and contains informations like
file names. Like the the other ResVars
any command may change its value (in fact only a few do),
therefore before using it you have to make sure that it
contains the information you need; in case of doubt make
a copy of it into a permanent string variable using the
SET command.
The Art of Coding
Introduction
In many situations it is preferable to replace numbers by some
well chosen symbol, reflecting the specific information we are
interested in.
Coded displays are used very often in EDA and quite a number
of commands has options to produce different forms of
coded displays.
The purpose of this chapter is to introduce the principles of coding.
Note that depending upon particular needs of a command, these
principles may vary; for instance it is not interesting to use a
blank space as a code on a coded histogram, whereas blank space is a
highly efficient code in a coded list.
Forms of coding
Coding is used to stress important aspects of information we are looking
for, by replacing the numerical form (which is often not easily
readable) by some well chosen symbol.
The following forms of coding are used in EDA:
Distributional coding
The symbols shown reflect the position of each observation within the
distribution of the variable, i.e. symbols are used to show whether a
case is a far out, out, adjacent or in value. Normally different symbols
are used for values below and above the median, e.g. we distinguish low
and high out values.
Bin coding
The cases are grouped into a number of bins according
to some criterion, and for each bin a different symbol is used. Several
criteria are possible: Each bin contains an equal number of observations
(Fractile coding); each bin corresponds to an interval of equal width
(interval coding) or the bins are defined by the user (indicates the bin
boundaries).
Reference coding
Each symbol reflects the position of a case with respect to some
criterion. A criterion often used is the median; then values below
the median are marked differently from cases above the median.
Instead of only marking low/high positions it is also possible to
indicate the distance of a case from a reference value, by either
using different symbols or by using more than one symbol, e.g.
a single plus sign for a case close to the median, and 2, 3 etc plus
signs for cases farther away.
Marking
All cases corresponding to some criterion are marked with a special
symbols, all other cases are not marked.
"as-is coding"
The (integer) numerical value of the variable is used directly "as is",
i.e. no intervals are computed. This is useful for categorical
variables, if you want to show a different code for each value of
the variable.
In the following section we shall examine the various forms of coding
in some detail. Note that the examples will mainly use the LIST command
as an illustration; several other commands work essentially the same
way.
Distributional coding
The symbols shown reflect the position of each observation within the
distribution of the variable, i.e. symbols are used to show whether a
case is a far out, out, adjacent or in value. Normally different symbols
are used for values below and above the median, e.g. we distinguish low
and high out values.
>LIST 1-4 DISTCODE
26 cases
Distributional coding (full); Symbols:(lo)"&=-*+#@"(hi)
ZBLUSONGZFSBBSAASGATTVVNGJ
HEURZWWLGROSLHRIGRGGIDSEEU
-----------------------------------
ICult |+***----***+**--****++*++*
XCult |****-+***+*****+*+-*+-&--+
Form |+****---**++**---***+***++
Sucre |-****+*-*#-=-******+-++**+
The following symbols are used: '@' for a high far out value,
'#' for a high out value, '+' for high adjacent, a star for an in value,
and '-','=','&' for low adjacent, out and far out values.
Note that these symbols might be different on your screen, as they can
be changed by the EDA administrator (because your system has nicer
looking symbols). You may also change the symbols for yourself, either
by putting it into your profile or setting the symbols differently.
Symbols may be changed using the SET GRAPH DISTCODE command (STAT GRAPH
DISTCODE shows them) for the whole interactive session or locally using
a "codes" string on the command line producing that particular output.
E.g. The command
LIST 1-4 DISTCODE "AB FG"
will use A and B for high far-out and out values, F and G for low
far-out
and out-values and blank for others.
The DISTCODE option has an additional option SIMPLE, i.e. we do not
want to distinguish between 'high' and 'low', i.e. the low and high
far-out, out and adjacent values will have the same symbol. Note that
it is also possible to use SET DISTCODE SIMPLE to produce the same
effect for all commands.
The following example uses the same coding scheme:
>SHOW 1-4 CODED
26 cases
Showing :ICult ( 1) Culture initiative
legend for coded values: (HI far)@ # + - = & (LO far)
canton ICult XCult Form Sucre
16 AI 7.2 + -
6 OW 7.6 + - +
8 GL 9.3 - -
15 AR 10.1 -
7 NW 10.2 -
5 SZ 10.2 -
20 TG 11.0 +
17 SG 11.1 -
3 LU 13.6
18 GR 13.8 +
14 SH 14.1
2 BE 14.3
10 FR 15.6 + #
23 VS 15.7 & +
Bin coding
The cases are grouped into a number of bins according to some criterion,
for each bin a different symbol is used.
Several criteria are possible.
In the following example the values are divided into four groups
containing approximately the same number of cases.
(Four groups is the default value).
This means that the distribution is broken up into four pieces
(fourth, quartiles), i.e. the bin boundaries are the hinges.
>LIST 1-4 FRACTILES
26 cases
Bins of equal size (APPROX.); Symbols:.:*#
ZBLUSONGZFSBBSAASGATTVVNGJ
HEURZWWLGROSLHRIGRGGIDSEEU
-----------------------------------
ICult |#*::....:**#**..::::##*###
XCult |:*:#.#*::#***::#:#.##....#
Form |#::*:...::##**...*::##**##
Sucre |.**:*#*.:#...::#:*:#.##:*#
The symbols used here are the default symbols, '.' for the lowest
fourth and '#' highest fourth.
On many systems these symbols are replaced by nicer looking graphical
symbols.
If you want more groups (or less) you will use the "symbols" string
to indicate the number of bins, by specifying a series of codes, each
character standing for a bin to be defined.
Then if you specify three symbols, the variable will be divided into
thirds, if you specify 10, tenths and so on.
We call that first form 'fractile coding'. Other forms are available.
This is usually the default coding used, unless, e.g. in the case of
the LIST command, interval coding is used by default (for "historical
reasons").
You may request to define de bins by cutting the variable into
intervals of equal width (interval length). Again by default it will
be divided into four intervals; if you need more or less use
the "symbols" string to indicate the number.
This is called 'interval coding'.
The READ option can be used to enter your own bin boundaries. The number
of
bins depend upon the number of codes in the "symbol string" (4 by
default). You will then be asked to enter the bin boundaries (one less
than bins requested).
Finally there is a special option for fractile coding, EXACT.
The normal form proceeds as follows (example: four bins, i.e. fourths):
EDA determines the hinges and the median, then all cases below the lower
hinge are assigned to the first bin, the case below the median, but
above the lower hinge to the second and so on.
This is fine as long as (e.g.) the lower hinge has a value occurring
only once in the variable. If the are several or many cases with the
same value this procedure will not define bins with equal numbers of
cases, but might - in some cases - produce quite different counts for
the
bins. In many cases this is however what you want, because for you the
hinges and the median have some meaning to you. In other situations
however (especially when experimenting with theoretical distributions
and the like) your really want to have identical counts in the bins.
Here the EXACT option will help to do exactly this.
Reference coding
Each symbol reflects the position of a case with respect to some
criterion. A criterion often used is the median; then values below
the median are marked differently from cases above the median.
>LIST 1-4 REFERENCE FUZZ=2.5
26 cases
Reference coding below/above center; Symbols:- +
ZBLUSONGZFSBBSAASGATTVVNGJ
HEURZWWLGROSLHRIGRGGIDSEEU
-----------------------------------
ICult |+ ---- ++ --- -++ +++
XCult | -+-++- + - + +-++----+
Form |+ - ----- +++ --- --++++++
Sucre |-+ ++ --+--- +- -+-++ +
In this example the symbols express whether a case
is below or above the reference (center)
value, i.e. a reference value stored
with each variable. By default this value is the median, but it
can be changed to contain other meaningful information. (e.g global
percentages and the like).
Values equal or close to the reference value appear in this example as
blanks. The example uses the fuzz=2.5 option to tell EDA, that equality
is not strict equality but within a range of the reference value
plus or minus 2.5. Note that if
the fuzz option is not used, the EDA system wide fuzz value is used (it
can
be set using the SET FUZZ command.
Some commands (e.g. the LIST command) have additional options for the
reference value.
Instead of only
marking low/high positions it is also possible to indicate the distance
of a case from a reference value, by either using different symbols or
by using more than one symbol, e.g. a single plus sign for a case close
to the median, and 2, 3 etc plus signs for cases farther away.
In the following example the reference is the median, and the
distance to the median is expressed using units of 1/2 midspreads, i.e.
each + symbol shows a distance of 1/2 midspread.
>LIST 1-4 CODED
26 cases
variable listing
units of 1/2.0 midspread
case ICult XCult Form Sucre
1 ZH + + -
2 BE +
5 SZ -
6 OW - ++ - +
7 NW -
8 GL - - -
9 ZG
10 FR + +++
11 SO + -
12 BS ++ ++ ---
13 BL --
Marking
All cases corresponding to some criterion are marked with a special
symbols, all other cases are not marked.
>list 1-4 mark if>45
26 cases
Mark values greater than <val>; Symbols:@
ZBLUSONGZFSBBSAASGATTVVNGJ
HEURZWWLGROSLHRIGRGGIDSEEU
-----------------------------------
ICult |
XCult | @ @ @ @ @ @@ @
Form |
Sucre | @ @ @ @ @@ @
In the previous example the cases above 45 are marked with the
EDA marking symbol (it might be different in your EDA version, and you can chang
GRAPHSYMBOL command.
"As-is coding"
This special form of coding is mainly useful for categorical
variables as it takes up the numerical (integer) values and codes
them directly, i.e. no computation (intervals, reference values)
is performed.
The default codes used are "0123456789", i.e. a "1" represents
a numerical value of 1, i.e. the default form does not really code
values, except for values below 0 (code "-") and values above 9
(code "+").
If you specify DICHOTOMY codes used will only be 0 (or space) and 1.
Positive values are shown as 1 and 0 or less as 0 (you may specify
different codes).
Command syntax
These various forms of coding are available with a number
of commands, namely the LIST, MAP, PLOT, CASID and HISTOGRAM
command.
Some commands might present slight differences with respect to the
forms shown above and the syntax explained below. Namely default
values might be different and some default codes might be changed.
E.g. It is not always desirable to show blank "symbols" on a plot, as
you will see nothing at that particular location...
The syntax chart below is taken from the HISTOGRAM command:
<code.opt>
BINS ¦ [FRAC] | EXACT | READ ["symbols"]
DISTRIBUTIONAL [SIMPLE] ["symbols"]
REFERENCE=value ["Symbols"] [FUZZ=val]
MARK|=val | IF>val |IF=val| IF<val | IF~val
["symbols"] [FUZZ=val]
ASIS ["symbols"]
DICHOTOMY ["symbols"]
It shows the various forms of coding.
The first line shows the various forms of 'bin coding', i.e.
FRACTILE (the default option), BINS requests interval coding,
EXACT defines exact fractiles and READ allows you to enter bin
boundaries. Optionally "symbols" is used to define alternative symbols;
the number of symbols you specify determine the number of bins
to create.
DISTRIBUTIONAL requests distributional coding. SIMPLE does not
distinguish between lower and upper far-out, out and adjacent values.
Finally symbols is used to enter alternative symbols.
REFERENCE requests reference coding. You may specify in addition
different symbols and a fuzz value.
MARK requests marking. There are four conditions equality (IF=value, or
just MARK=value), greater or less than a value (IF<val, IF>val) or
inequality (If~val). For the equality /inequality option you may
specify a FUZZ value.
The "symbols" string is used to specify other symbols than
the default symbols.
ASIS requests as-is mode. By default the symbols are "0123456789"
(zero replaced by a space with LIST ASIS); values above 9 appear
as "+" and values below 0 as "-". If you specify your own symbols,
e.g. "abcd", "a" will stand for 0, "b" for 1, "c" for "2" and "d" for
3; values larger than 3 will appear as "+", values smaller than 0 as
"-". Please not that you need not specify the codes for values outside
the range ("-" and "+").
DICHOTOMY treats the
variable as a binary variable. Values of zero or less appear as
"0" (blank with LIST DICHOTOMY), positive values as "1". You may
specify alternative symbols (note that only the first two symbols
will be taken).
Power transformations/reexpressions
Power transformations are important to reexpressions. Many commands
within EDA let you perform reexpressions or assist you in finding
an appropriate re-expression of your data.
Power transformations are frequently used to transform data; John
Tukey has developped a very useful framework for re-expressions,
called the ladder of powers, sometimes called the Tukey's simple
family of power transformations, i.e. all positions on the ladder
of powers can be written as powers of the original variable.
The EDA software offers in many situations two ways of dealing with
power transformations (1) reference to the ladder of powers (moving up
or down) or (2) giving directly the power of the re-expression you
want to obtain.
In the following example we will refer to the REEXPRESS module; please
note that you will encounter similar options and commands in
other contexts (like PLOT INSPECT).
Moving up or down the ladder
When you are looking for an appropriate transformation (e.g. to symmetriz
a distribution) you are usually less interested in the
mathematical formulation of potential transformations
than in seeing what is done to your variable. Here the ladder of
power image comes in handy, i.e. starting with the raw data you may
move up the ladder of powers (i.e. taking squares)... and when you
see - from the boxplot that will be displayed - that the symmetry problem
gets even worse you might move down the ladder, let's say two steps,
if the boxplot tells you that the current step on the ladder is still
not enough you may move down a further step (without worrying about
what kind of transformation it takes to do that) or maybe correct
by going back on step.
Direct specifications of powers
The ladder of power is - for practical reasons - limited to the
most used transformation (in the EDA implementation for instance
the highest step up is 3, i.e. cube): If you need to take the
fourth power you might then use (in the case of the REEXPRESS module)
the POWER=4 command or - in the case of the TRACES command - the
POWER=4 option. These options or commands let you also select
a power or 2.5.