14 June, 2020

A brief primer on data viz

Core material

Why visualize?

  • We are visual beings
  • Graphics are effective forms of storytelling
  • Statistics can be deceptive when presented alone
    • For example, Anscombe’s Quartet:

What actually makes a figure?

Grammar of Graphics—Wilkinson (1999)

Grammar of Graphics—Wilkinson (1999)

A framework for describing components of a figure and the structure that underlies statistical graphics.

  • DATA — variables from datasets
  • TRANSformations — variable transformations
  • FRAME — a set of variables, combined with operators, defining a space
  • SCALE — scale transformations
  • COORDinates — a coordinate system
  • GRAPH — graphs and their aesthetic attributes
  • GUIDE — axes, legends

Layered Grammar of Graphics—Wickham (2005)

Layered Grammar of Graphics—Wickham (2005)

Layers of grammar

Data

Most commonly as a data frame (ggplot requires this structure)

data(anscombe) # let's continue with the Anscombe example
library(stargazer)
stargazer(anscombe, type = "html", summary = FALSE)
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.040 9.140 7.460 6.580
2 8 8 8 8 6.950 8.140 6.770 5.760
3 13 13 13 8 7.580 8.740 12.740 7.710
4 9 9 9 8 8.810 8.770 7.110 8.840
5 11 11 11 8 8.330 9.260 7.810 8.470
6 14 14 14 8 9.960 8.100 8.840 7.040
7 6 6 6 8 7.240 6.130 6.080 5.250
8 4 4 4 19 4.260 3.100 5.390 12.500
9 12 12 12 8 10.840 9.130 8.150 5.560
10 7 7 7 8 4.820 7.260 6.420 7.910
11 5 5 5 8 5.680 4.740 5.730 6.890

Aesthetics

Every aspect of a given graphic element


Aesthetics

library(ggplot2)
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank()

Scale

The mapping between data values and aesthetic values

We can scale any aesthetic component to our liking based on our data:

  • Position: X/Y axes limits and breaks? Continuous? Discrete?
    • e.g., scale_y_discrete(), scale_y_continuous(), scale_y_date()
  • Shape: what shapes?
    • e.g., scale_shape()
  • Size: what sizes?
    • e.g., scale_size()
  • Colors: which colors? Continuous? Discrete?
    • e.g., scale_color_manual(), scale_fill_manual()
  • Line width/type: what widths, what types?
    • e.g., scale_size(), scale_linetype()

Scale

library(ggplot2)
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank() +
  scale_x_continuous(limits = c(0,20), breaks = seq(0,20,5)) +
  scale_y_continuous(limits = c(0,20), breaks = seq(0,20,5))

Geometric objects

What physical features we actually put in the plot area.

The most vocabulary-rich layer.

  • Points
    • geom_point()
  • Lines
    • geom_line(), geom_hline(), geom_vline(), geom_abline(), geom_contour(), geom_path(), geom_segment()
  • Bars
    • geom_bar() or geom_col()
  • Polygons
    • geom_polygon(), geom_rect(), etc.
  • Tiles
    • geom_tile(), geom_rect(), geom_raster(), etc.
  • Text, annotations
    • geom_text(), geom_label(), annotate()

Geometric objects

ggplot(data = anscombe, aes(x = x1, y = y1)) + 
  geom_point()

Statistical objects

Geoms with statistical properties.

  • Boxplots
    • geom_boxplot()
  • Histograms
    • geom_hist()
  • Distributions
    • geom_density(), geom_rug(), geom_violin()
  • Error
    • geom_errorbar(), geom_ribbon(), stat_summary()
  • Rugs
    • geom_rug()
  • Other statistical summaries
    • geom_smooth(), stat_summary()

Statistical objects

lm(y1~x1, data = anscombe)
## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Coefficients:
## (Intercept)           x1  
##      3.0001       0.5001

Statistical objects

ggplot(data = anscombe, aes(x = x1, y = y1)) +
  geom_point() +
  geom_abline(intercept = 3.0001, slope = 0.5001)

Facets

Small multiples!

Going beyond two dimensions.

  • Small exercise: how many dimensions can you plot at once?

    • Answer: You can easily plot SEVEN dimensions in R and still have a reasonably clear figure if you include two axes (X, Y), two facet structures, one color aesthetic, one shape aesthetic, and one size aesthetic.

    • Eight dimensions if you do all of the above with some time animation (ggannimate).

    • Nine dimensions if you do all that add a third axis (e.g., a surface plot with a Z variable), but doubtful this will be useful to many people.

    • Ten dimensions if you use the secret spacetime() function in developer mode to distort general relativity in your data visualization and open a portal to graphics hell.

Facets

How can we reproduce Anscombe’s original quartet figure?

# first we need to reshape the data
anscombe.quartet <- data.frame(X = c(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4),
                               Y = c(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4),
                               dataset = c(rep("I", 11), rep("II", 11), 
                                           rep("III", 11), rep("IV", 11)))
head(anscombe.quartet)
##    X    Y dataset
## 1 10 8.04       I
## 2  8 6.95       I
## 3 13 7.58       I
## 4  9 8.81       I
## 5 11 8.33       I
## 6 14 9.96       I

Facets

Small multiples with facet_wrap() or facet_grid()

ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(~dataset)

Coordinate system

Set of position scales combined with their geometrical arrangement

  • Cartesian: 2 dimensional (X and Y), linear
    • coord_cartesian() (default), coord_equal(), coord_fixed()
  • Nonlinear (e.g., a log transformed axis)
    • coord_trans(), coord_munch()
  • Curved (e.g., polar coordinate system) - good for periodic data, geospatial data
    • coord_polar()

Coordinate system

Yes, that includes spatial data (we can spend a whole day on spatial data).

Extra syntax: legends

To include or not?

  • When is it redundant, when is it useful?

  • Where to put them?

  • Not always necessary to have a standalone legend (for example, below).

Extra syntax: themes

In ggplot we can modify virtually any thematic plot component with theme() and labs()

  • Axes, ticks, background grids—theme()

  • Axis labels, plot title—labs()

newthemes <- 
  ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(.~dataset) + 
  labs(x = "new x-axis label", y = "new y-axis label", title = "Fancy title") +
  theme(axis.title = element_text(size = 16),
        axis.text = element_text(size = 14))

Extra syntax: themes

newthemes

Extra syntax: themes

We can also use some pre-installed themes from ggplot and other packages like ggthemes.

theme_gray() (default)

default <- 
  ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(.~dataset) + ggtitle("default") + theme_gray()

Extra syntax: themes

default

Extra syntax: themes

Other ggplot2 pre-installed themes.

bw <- default + theme_bw() + ggtitle("theme_bw()")
classic <- default + theme_classic() + ggtitle("theme_classic()")
minimal <- default + theme_minimal() + ggtitle("theme_minimal()")

library(ggpubr)
theme.plots <- ggarrange(default, bw, classic, minimal, nrow = 2, ncol = 2)

Extra syntax: themes

theme.plots

Extra syntax: themes

Some themes from the ggthemes package.

library(ggthemes)
economist <- default + theme_economist() + ggtitle("theme_economist()")
wsj <- default + theme_wsj() + ggtitle("theme_wsj()")
fivethirtyeight <- default + theme_fivethirtyeight() + ggtitle("theme_fivethirtyeight()") 
tufte <- default + theme_tufte() + ggtitle("theme_tufte()")

ggthemes.plots <- ggarrange(economist, wsj, fivethirtyeight, tufte, 
                            nrow = 2, ncol = 2, align = "hv")

Extra syntax: themes

ggthemes.plots

Extra syntax: themes

You can even make an xkcd style figure:

days <- seq(0,10,1)
stench <- days - 2.31*I(days^2) + 0.15*I(days^3) + 11*days
library(xkcd)
library(extrafont)

stench.plot <- 
  ggplot() + geom_line(aes(x = days, y = stench), color = "steelblue", lwd = 2) + 
  xkcdaxis(xrange = c(0,10), yrange = range(stench)) + 
  labs(x = "Days since last shower", y = "Stench", title = "The stench curve") +
  geom_segment(data = NULL, aes(x = 8, y = 25, xend = 8, yend = 40), lty = "dashed") +
  geom_segment(data = NULL, aes(x = 8, y = 25, xend = 10, yend = 25), lty = "dashed") +
  xkcdline(aes(x=4.5,y=22,xend=4.5,yend=24), data = NULL, xjitteramount = 0.12) +
  xkcdline(aes(x=9,y=24,xend=9,yend=20), data = NULL, xjitteramount = 0.12) +
  scale_x_continuous(breaks = 0:10) +
  annotate("text", x = 4.5, y = 26, label = "poop plateau", family="xkcd") +
  annotate("text", x = 9, y = 18, label = "point of no return", family="xkcd") +
  theme(text=element_text(size=16, family="xkcd"),
        axis.text.y = element_blank(), axis.ticks.y = element_blank()) 

Extra syntax: themes

stench.plot
The empirically-driven hiker stench curve (Goldspiel & Fuller, 2018).

The empirically-driven hiker stench curve (Goldspiel & Fuller, 2018).

Colors

Color palettes

Color palettes—examples

library(RColorBrewer) 
display.brewer.all(colorblindFriendly = TRUE) # colorblind friendly palettes

Color palettes

# the good old iris data (default colors)
def <- ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(fill = Species)) + 
  theme_minimal() + theme(legend.position = "top") + ggtitle("default")
# max ernst colors
library(lisa)
dali <- def + scale_fill_manual(values = lisa$MaxErnst) + ggtitle("Woman, Old Man, and Flower")
ggarrange(def, dali, ncol = 2)

Efficient graphics

  • Only including elements you need to convey your data, and nothing more

  • “Data-ink” versus “chartjunk”

  • Edward Tufte, the “data-ink ratio”

Efficient graphics

Spagetti disaster.

  • Things to avoid:
    • Gratuitious use of colors
    • Overuse of precision (e.g., 3.1415926535897932384626433…)
      • Don’t go beyond three decimal places
      • Be careful with number of axis ticks and labels

Efficient graphics

Climate trends in Galapagos.

Efficient graphics

Graphs are not self-promoting. They are used to convey quantitative information, not stylistic information. Effective graphs can (and should) be pretty, but focus should always be on function, not decoration. Decorative graphs are “ducks.”

  • Don’t make a duck.

Efficient graphics

  • When do you actually need bars?

    • Answer: Never. Bars are data summaries, and you can more efficiently (and effectively) convey data summaries with points and supplementary geoms to visualize distributions and uncertainty.

Honest graphics

  • Not distorting axes

  • Bergstrom & West (2016) “The theory of proportional ink”: using responsible sizes for shaded areas (e.g., bars, polygons). In a visualization, size should be proportional to the nature of the actual data.

Ethical graphs

Don’t include personal beliefs and biases in graphs.

Chiou and Bergey (2018)

Other resources:

Part II: Applications

Data to viz