Bars

Comparisons of Categorical Variable Sub-groups

Bar charts are commonly used to compare absolute counts or distributions of several variables. Bar charts are appropriate for nominal variables, which are neither ordered nor quantitative, and ordinal variables, which are ordered but not quantitative.1

  • 1 In general, lines are inappropriate for comparing nominal and ordered variables because neither are quantitative. That is, the physical distance between categories, and thus the slope of the line, is arbitrary. In addition the order is also usually arbitrary, so that certain trends can be misrepresented. However, this is exactly what is done with parallel plots (see page @ref(sub:Parallel-plots)). Contrast this with interval variables, which are quantitative. The spacing of categories is determined by the interval being described.

  • Interval variables can be represented with bar charts, but since they are ordered and quantitative, they can be treated more effectively with line plots than with bar charts.

    There are several reasons why bar charts should be used only for the visualisation of absolute counts. In the bar chart below, the absolute number of samples in each sub-group of the eating habits categorical variable are depicted. The bar is an intuitive representation of this since its values are measured as a growth above the origin to a fixed point.

    Ranking plots provide a simple extension of the nominal comparison. Begin by thinking about what values you want to emphasize. This will usually be the most or least abundant observation. Next, arrange the nominal scale data accordingly, using dots or bars. This modification can greatly aid the reader in following your line of thought and conveying an inherent order to otherwise unordered categories. Keep in mind that you should be consistent in your ordering of nominal categories throughout a paper or presentation.

    A bar chart representing the absolute count of observations in each category of eating behaviour.

    Use bar charts to compare absolute counts, but not distributions

    We do not recommend using bar charts to depict distributions. This is primarily because the actual data is masked, making it very difficult to discern the distribution of underlying data points. Consider the alternative plots below, where each individual point is displayed.

    Jittered points are a reasonable alternative to bar plots for depicting the distribution of a data set, in this case the total sleep time according to eating behaviour. First plot: Individual observations plotted as a dot plot reveal the entire data set’s distribution. Second plot: A bar chart showing only the mean and SD masks the underlying distribution. Third plot: A simplified dot plot for mean and SD is an improvement, but still obscures the underlying data. Fourth plot: Overlaying the individual data points with the mean and SD provide both the distribution and descriptive statistics.

    Jittered points are a reasonable alternative to bar plots for depicting the distribution of a data set, in this case the total sleep time according to eating behaviour. First plot: Individual observations plotted as a dot plot reveal the entire data set’s distribution. Second plot: A bar chart showing only the mean and SD masks the underlying distribution. Third plot: A simplified dot plot for mean and SD is an improvement, but still obscures the underlying data. Fourth plot: Overlaying the individual data points with the mean and SD provide both the distribution and descriptive statistics.

    Jittered points are a reasonable alternative to bar plots for depicting the distribution of a data set, in this case the total sleep time according to eating behaviour. First plot: Individual observations plotted as a dot plot reveal the entire data set’s distribution. Second plot: A bar chart showing only the mean and SD masks the underlying distribution. Third plot: A simplified dot plot for mean and SD is an improvement, but still obscures the underlying data. Fourth plot: Overlaying the individual data points with the mean and SD provide both the distribution and descriptive statistics.

    Jittered points are a reasonable alternative to bar plots for depicting the distribution of a data set, in this case the total sleep time according to eating behaviour. First plot: Individual observations plotted as a dot plot reveal the entire data set’s distribution. Second plot: A bar chart showing only the mean and SD masks the underlying distribution. Third plot: A simplified dot plot for mean and SD is an improvement, but still obscures the underlying data. Fourth plot: Overlaying the individual data points with the mean and SD provide both the distribution and descriptive statistics.

    Comparing Distributions with Multiple Histograms

    Returning to the mammalian total sleep time data set, we will consider how to depict multiple distributions with histograms and density plots. Figure @ref(fig:ggplot2-density-examples) displays four sub-categories of mammals according to their eating behaviour. Their distributions are presented using four varieties of multiple histograms and two varieties of density plots:

    Stacked histograms where each bin is vertically stacked (Figure @ref(fig:ggplot2-density-examples), top-left).

    Proportional stacked histograms where the height of each bin is 1 (Figure @ref(fig:ggplot2-density-examples), top-middle).

    Dodged histograms where all histograms are interleaved (Figure @ref(fig:ggplot2-density-examples), top-right).

    Frequency polygon where each histogram is presented as an outline instead of bars (Figure @ref(fig:ggplot2-density-examples), bottom-left).

    Overlapping density plot where overlapping lines depict the density estimation (Figure @ref(fig:ggplot2-density-examples), bottom-middle).

    Overlapping density area plot where overlapping transparent areas depict the density estimation (Figure @ref(fig:ggplot2-density-examples), bottom-right).

    None of the histogram varieties allow the reader to see the underlying distribution of each group plotted. This is in contrast to overlapping density plots, which allow easy decoding of the underlying distributions. In particular the area-shaded density plot does an effective job of allowing the reader to distinguish between the four groups.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.

    Plotting the distribution of four sub-groups using four varieties of histograms and two varieties of density plots. Top: stacked, proportional and dodged histograms. Bottom: Frequency polygon, density outline and density area plots.