Diagnose a continuous variable
Introduction

When examining a variable you will consider several aspects at the same time...

Statistical aspects

Besides aspects of suitability of the variable as an indicator for your concept and missing values and other measurement problems, the following statistical aspects need to be considered:

Illustration of various key aspects

It is important to understand that all statistical tools rely on a number of , if the assumptions do not hold, the tool will mislead you. For instance, the mean requires the distribution to be symmetrical. By the way as the mean is central in classical statistics, the requirement of applies to many tools (correlation, regression, ....).

Ideally speaking the shape of the distribution should be symmetrical as shown in the histogram to the left (top)

Asymmetrical distributions need to be diagnosed and you should be aware that the mean for instance cannot be used to describe its central tendancy.

Obviously whenever more than one peak is present, a simple summary will not be enough. In this case a single mean is completely misleading, as it does not inform you about the essential feature of the distribution, i.e. it has two peaks,

When diagnosing variables it is important to use several tools. Numerical summaries like the mean and the standard deviation are clearly insufficient; graphical tools are ideally suited for this purpose, but you should be aware of the fact that all tools have their strengths and weaknesses and, as a general advice, it is best to use several of them. In the examples below we will use histograms, boxplots and dotplots.

Histograms

Here's a collection of different histogram shapes.

  1. A completely uniform, even distribution of the values over the range of the variable.
  2. A roughly symmetrical distribution: Peak in the middle, fewer values towards the minimum and the maximum of the distribution.
  3. The kind of ordinary histogram you will come across very often with real data. Not perfectly symmetrical, but not too far away either.
  4. There are a few outlying values at the lower end of the distributions, all others are concentrated at the other end, and you cannot see their distribution, because the presence of outlier(s) forces a scale with a lot of "empty space". Identifying the outliers and removing them form the histogram will help.
  5. A U-Shape distribution with a "hole" in the middle. Obviously a summary like the mean will be completely off mark, it will be in the middle, where there are very few observations
  6. There are observations at regular intervals, with none at all in-between. This often means that you produced a histogram with a categorical variable.

Histograms are useful graphical summaries, but they have there shortcomings. In the collection the histograms have only a couple of bins (to the left only three). Note that the choice of the numbers of bins (bars) can, with some variables, lead to wrong conclusions about the shape and other aspects of the distribution.

The histogram with three bins considered also "hides" the fact, that the observations are not evenly distributed across the range of the observations and within the bins. (The ticks mark the positions of data values in a variable), and fails to let us see the .

The boxplot at the bottom of the graph clearly identifies that outlier and gives us a alternative picture of our distribution. Note however that both the histogram and the boxplot do not show clearly that the main crowd is in the middle, to the left of it, two single points, and, far away to the right a smaller dense group.

Boxplots

Boxplots are powerful tools to show central tendancy (median=dot in the box); variation (interquartile range = width of the box), shape of the distribution (position of the box, relative position of the median) and identifies the presence of outliers. Some examples with comments:

  1. A nicely symmetrical distribution, median in the middle, no outliers
  2. Same, but much smaller variation (box is much shorter): Important within the box you find 50% of the observations!
  3. Asymmetrical distribution (strongly skewed to the left: box to the left, the position of the median to the left in the box. Note that between the median and the minimum value there are 50% of the observations
  4. There is an outlier!
  5. More outliers on both sides....
  6. Even more stronger outliers. The box is barely visible but still there are 50% of the observations in the box and 25% to the left of the box, i.e. a very strong concentration towards the left side of the distribution (asymmetrical, strongly left-skewed distribution).

Tools compared

Let us now examine a collection of examples using three graphical tools: boxplots, histograms and dotplots, and considering also two summary statistics the mean and the standard deviation (commented in the text).

The collection clearly shows that you always need several tools to tell the full story of each distribution, a single tool alone cannot catch all aspects, numerical summaries like the mean or the standard deviation are not always useful as they rely on highly specific assumptions, namely symmetry.

A uniform distribution; the mean is appropriate to summarize the center, as is the standard deviation to characterize spread, but you really need the histogram to see the perfectly uniform distribution, the boxplot only tells you that this is a symmetrical distribution with no outliers.

Only the dotplot and the histogram let you detect the main feature of this variable, i.e. there are clearly two groups of observations at both ends of the distribution, the middle part left empty. The average is completely off mark, as is the standard deviation
In this case the dotplot clearly shows you the scattered, discontinuous nature of this distribution, with small groups and isolated observations. The boxplot shows you the asymmetry of the distribution (but not the discontinuous aspect); the histograms with ten bins does not let you see the discontinuities either.
Here you clearly need both the dotplot (or the histogram) and the boxplot. The dotplot lets you see a big group and a small group; the boxplot however tells you that the small is not so small that they have to be considered as outliers.
Obviously the boxplot is the best tool here, as it clearly shows the outlier and identifies it as such; the dotplot clearly shows the one observation far away form the others (in a somewhat less extreme only the boxplot would let you distinguish between an outlier and a single observation somewhat apart from the others.
Five clearly distinct groups (is this really a continuous variable); only the dotplot lets you discover this, the histogram to some extent, but you might be mislead by the width of the bin.
Related documents