67  Exploratory Data visualization

Data visualizations are a form of graphical data analysis, i.e. a part of your analytical tool-kit Thus, we must appreciate the intimate link between graphics and statistics, in particular when using graphics as a first step in exploratory data analysis. A classic example of this concept is Anscombe’s plots, shown in Figure 67.1. In each plot a different data set is described not only by the same linear model, \(\hat{y_i} = 3.0 + 0.5x_i\), but also by the same correlation coefficient, \(r = 0.82\). Each plot tells a strikingly different story, but in this case, the statistical analysis performed does not provide enough information about the underlying distribution of the data.

Figure 67.1: Anscombe’s plots. Although their distributions are distinct, each data set is described by the same linear model and correlation coefficient. Relying on numerical analysis alone does not tell the complete story.

As another example, let’s consider the relationship between mammalian body weight and brain weight, working with a small data set of representative members from 62 species. The head and tail of our data set are given in Table 67.1:

Species Body weight (Kg) Brain weight (g)
African elephant \(6 654.000\) \(5 712.00\)
Asian elephant \(2 547.000\) \(4 603.00\)
Giraffe \(529.000\) \(680.00\)
Horse \(521.000\) \(655.00\)
Cow \(465.000\) \(423.00\)
\(\vdots\) \(\vdots\) \(\vdots\)
Musk shrew \(0.048\) \(0.33\)
Big brown bat \(0.023\) \(0.30\)
Mouse \(0.023\) \(0.40\)
Little brown bat \(0.010\) \(0.25\)
Lesser short-tailed shrew \(0.005\) \(0.14\)
Table 67.1: The tail ends of the mammalian body and brain weight data set.

Looking at the data, our first problem becomes apparent — both variables have extremely large ranges. That’s not surprising, and we’d probably come to this conclusion just by thinking about what we expect to see. Further consideration would lead us to the reasonable assumption that both variables are heavily positively skewed. This is pre-existing information we have about our data before we even plot it. It’s likely that you have pre-existing knowledge of your data and the expected distribution because of your domain expertise.

Domain expertise allows us to anticipate appropriate data visualizations before actually working on the data.

If you are an experimentalist, it’s important that you don’t discount your domain knowledge. In particular, the relationship between variables, purpose of the experiment of expected distributions will all come in handy when visualizing, like performing any statistics. When consulting a data scientist or colleague for help, be clear and forthcoming about what you anticipate the data will look like, especially if they are not familiar with your experiments.

We can already see that it’s going to be difficult to plot such disparate values on a single plot, but let’s give it a go! To understand the relationship between the two variables, scatter plots are a logical first choice, as shown in fig. Figure 67.2, left, See ?sec-ScatterPlots for more details on scatter plots.

Our initial plot confirms what we already guessed — both variables are heavily positively skewed — making our scatter plot difficult to interpret. This is the first and most typical use of exploratory plots:

Exploratory plots allow you to assess the quality and distribution of your data during EDA.

But there is another equally important use of exploratory plots at this stage.

Exploratory plots encompase diagnostic plots that allow you to assess the quality of your statistical methods.

In this sense we use data visualization as a statistical tool. For example, here, we’d like to calculate the relationship between brain and body weight using a linear model (Figure 67.2, top).

Given that both variables are positively skewed and the two Elephant species have an enormous influence, this model is not really appropriate.

In a well-fit model, the distribution of the residuals — the distance between each observed and predicted \(y\) value, \(y_i - \hat{y}\), should be Normal. In our case that means that the differences between the observed and predicted brain weights should be Normally distributed. A typical and useful diagnostic plot for for assessing distributions is a Quartile-Quartile (Q-Q) plot. For now, all we need to know is that the more closely our residuals (the dots) fall onto the Q-Q line (Figure 67.2, bottom), the farther they are from a normal distribution. This confirms our suspicions that our model is not really the best choice.

The two extreme values from the African and Asian Elephants will also have a large influence on the linear model compared to the small values clustered near the origin. Visualizing influencers is a great diagnostic plot that you can use to asses you models.

Figure 67.2: An exploratory scatter plot of mammalian brain vs body weight described by a linear model.

So far, our exploratory plots revealed that the data set is poorly-presented in its present state and also poorly-described by our linear model. A solution is to use log-transformed data. The scatter plot of the log-transformed data-set allows each data point to be distinguished (Figure 67.3, top) and reveals that a log-linear model is much more appropriate. The residuals of the linear model also fit the normal distribution much better (Figure 67.3, bottom).

Figure 67.3: An exploratory scatter plot of the log-transformed mammal dataset.

Remember, transformation functions are often used to adjust for some amount of positive or negative skew in the data. \(log_e\), \(log_2\) and \(log_{10}\) are very common and many models perform better with log-log, log-linear, or linear-log relationships.