Points

Comparing Two Continuous Variables using Scatter Plots

The correlation of two continuous variables is often depicted using scatter plots. Here, we will use Anderson’s iris data set as an example of composing a comprehensive scatter plot. The data set contains 4 continuous variables (width and length of both petals and sepals), with 50 observations each. A fifth, categorical, variable specifies the species of iris plant. Using only sepal width and length, setosa iris is linearly separable from the versicolor and virginica, which are not linearly separable from each other. Plotting only the sepal width and length of the setosa iris is a good starting point.

Don’t obscure data by over-plotting individual data points

The biggest deficiency with this plot is 1that it suffers from over-plotting. Some data points are obscured because several points are plotted over each other. We can get around this problem by jittering - adding a small amount of random noise - to each point. This is acceptable because a visualisation does not need to be representative of the exact data. Therefore, if the reader wants to know what the exact value at a given point is, then a table look-up is required.

  • 1 A less common method, sunflower plots, depicts overlapping points with a flower-like pattern, having as many ``petals’’ as there are overlapping data points.

  • Figure 1: Sepal width versus sepal length for 50 observations of Setosa iris. Left: Over-plotting can obscure data points. Middle: Jittering is the first step in alleviating over-plotting. Right: Removal of unnecessary non-data ink aids clarity.

    Figure 2: Sepal width versus sepal length for 50 observations of Setosa iris. Left: Over-plotting can obscure data points. Middle: Jittering is the first step in alleviating over-plotting. Right: Removal of unnecessary non-data ink aids clarity.

    Figure 3: Sepal width versus sepal length for 50 observations of Setosa iris. Left: Over-plotting can obscure data points. Middle: Jittering is the first step in alleviating over-plotting. Right: Removal of unnecessary non-data ink aids clarity.