To be analysable with statistical tools, data needs to be presented as a rectangular data matrix. Each column in a data matrix contains a (indicator, measurement, questions in a survey...) and each row an () (countries [example below], individuals in a survey, subjects in an experiment, institutions, groups of persons,......].
Each cell contains a single value for a particular variable and observation, e.g. the GDP per capita for Albania. If the value is not available, the cell content will contain show somehow that the value is missing ( indicator); all statistically oriented software will automatically skip that kind of value in computations.
Here's a schematic representation of a (example with country data):
Country | Continent | ContNum | GDPperCapita | Variablei | .... | Variablek |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1 | value | value | value | value |
Albania | Europe | 3 | value | value | value | value |
... | ... | .... | .... | .... | .... | .... |
Switzerland | Europe | 3 | value | value | value | value |
countryn | ... | value | value | value | value | value |
Each column has a name () used to refer to it. All values of a particular variable have to be of the same type (numeric, string,...) In the example above obviously the country and continent names are strings, and GDP per capita contains numerical information.
A second example with survey data:
Id | Language | PolInt | PartyPref | Gender | Age |
---|---|---|---|---|---|
100 | Italian | 1 | 4 | 1 | 28 |
101 | German | 4 | 1 | 1 | 64 |
102 | French | 1 | 11 | 2 | 37 |
103 | German | 3 | -1 | 2 | 41 |
... | ... | .... | .... | .... | .... |
in | ... | value | value | value | value |
Survey data is usually coded data, i.e. the textual answers to questions are coded, e.g. High Political Interest (PolInt) is coded as 1, low interest as 4; 4 on PartyPref records the answer of someone preferring the Socialist Party. Age of course records simply the age of the person.
If a person does not answer a question (don't know, refuse to answer, etc, a specific code is used; in the example PartyPref interviewee id=103 did not select one of the parties. A code of -1 has been entered for a missing answer. You will then need to instruct the statistical software to consider -1 for that variables as missing, i.e. not include it into statistical computations.
The rectangular data matrix is mandatory for statistical analysis; if data is presented in a different way it has to be restructured first to produce a rectangular data matrix.
A data matrix needs to be documented to be meaningful in any analysis.
Statistical software provides a way of describing and documenting the data matrix. More specifically:
In addition to documenting the data using descriptive labels, it is essential to document the data source (who has provided/collected/produced the data, how it has been produced/collected etc.). Information on data quality and the measurement process are also important, as are, if the data is a sample, the details on the sampling procedure, the sample size, non-responses. For a survey questionnaires need to be available, in all languages used for interviewing people.