Mosaic Plots

Comparing two or more Categorical Variables using Mosaic Plots

Mosaic plots are an excellent alternative to stacked bar plots. The major difference here is that we are essentially representing our data as a contingency table. Each sub-group is presented as a box (hence mosaic plot) with area proportional to sample size. Notably, scales are absent. Relying on area rather than a common scale is one of the major disadvantage of pie charts, but as we have seen, even stacked bar charts don’t adequately solve this problem. In short, we are forced to encode proportions as areas, where calculating the area and line lengths of boxes is certainly more intuitive than slice area and arc length of circles.

Use mosaic plots to visualise proportional comparisons of multiple categorical variables

The following figure is a direct conversion of stacked bar charts into mosaic plots, showing all three pair-wise comparisons for sex, hair colour and eye colour.

The real strength of mosaic plots is their ability to represent three categorical variables simultaneously (Figure fig-Mosaic-plots-1). This is the first plot that starts to tell the reader something interesting. The area of each box represents three variables, all possible combinations can be viewed and compared in a single plot. A logical extension of the basic mosaic plot is to reveal our underlying statistical analysis, where we asked which sub-groups are over- or under-represented. Using a color scale, boxes (understood as sub-group intersections) are shaded according to their relative over- or under-representation in the data set as a whole. This communicates a story to the reader in an effective manner, using appropriate encoding and coloring to convey a message. In this case, the story is that categories with high positive residuals are more frequent in the population than the equiprobability model would predict.

Note that in figures Figure fig-mosaic-first-view and Figure fig-associaiton-plot, color is mapped to a continuous variable, something that we advised against in section Figure fig-Cont-Encoding. In this case it is permissible because the continuous scale (Pearson’s residuals) has been partitioned into discrete bins, making it a categorical (interval) variable.

Figure 1: Mosaic plots with three variables (hair colour, eye colour and sex) depicting all possible combinations in the data set. Uniform shading.

Figure 2: Shading according to Pearson’s residuals.

We can go one step further in our representation of these three categorical variables and use area to represent the results from a Pearson \(\chi^{2}\) test as an association plot, as shown in Figure fig-associaiton-plot. The heights are proportional to the standardized Pearson residuals and the width is proportional to the square root of the expected value for that category given the equiprobability model. In each row, the base line equals independence and each box is plotted on a common scale. Here position and color clearly indicate which categories are over- and under-represented, and to what degree.

Figure 3: An association plot of all 32 possible hair colour/eye colour/sex combinations.