Data Visualization

14 June, 2020

A brief primer on data viz

Core material

https://serialmentor.com/dataviz/proportional-ink.html

Why visualize?

We are visual beings
Graphics are effective forms of storytelling
Statistics can be deceptive when presented alone
- For example, Anscombe’s Quartet:

What actually makes a figure?

Grammar of Graphics—Wilkinson (1999)

A framework for describing components of a figure and the structure that underlies statistical graphics.

DATA — variables from datasets
TRANSformations — variable transformations
FRAME — a set of variables, combined with operators, defining a space
SCALE — scale transformations
COORDinates — a coordinate system
GRAPH — graphs and their aesthetic attributes
GUIDE — axes, legends

Layered Grammar of Graphics—Wickham (2005)

Hadley Wickham built on Wilkinson’s grammatical framework, describing a layered system that:

Has a different arrangement of grammatical components
Has a hierarchy of defaults
Can be easily implemented in a programming language (in our case R)

https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf

Layered Grammar of Graphics—Wickham (2005)

Layers of grammar

Data

Most commonly as a data frame (ggplot requires this structure)

data(anscombe) # let's continue with the Anscombe example
library(stargazer)
stargazer(anscombe, type = "html", summary = FALSE)


	x1	x2	x3	x4	y1	y2	y3	y4

1	10	10	10	8	8.040	9.140	7.460	6.580
2	8	8	8	8	6.950	8.140	6.770	5.760
3	13	13	13	8	7.580	8.740	12.740	7.710
4	9	9	9	8	8.810	8.770	7.110	8.840
5	11	11	11	8	8.330	9.260	7.810	8.470
6	14	14	14	8	9.960	8.100	8.840	7.040
7	6	6	6	8	7.240	6.130	6.080	5.250
8	4	4	4	19	4.260	3.100	5.390	12.500
9	12	12	12	8	10.840	9.130	8.150	5.560
10	7	7	7	8	4.820	7.260	6.420	7.910
11	5	5	5	8	5.680	4.740	5.730	6.890

Aesthetics

Every aspect of a given graphic element

Aesthetics

library(ggplot2)
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank()

Scale

The mapping between data values and aesthetic values

We can scale any aesthetic component to our liking based on our data:

Position: X/Y axes limits and breaks? Continuous? Discrete?
- e.g., scale_y_discrete(), scale_y_continuous(), scale_y_date()
Shape: what shapes?
- e.g., scale_shape()
Size: what sizes?
- e.g., scale_size()
Colors: which colors? Continuous? Discrete?
- e.g., scale_color_manual(), scale_fill_manual()
Line width/type: what widths, what types?
- e.g., scale_size(), scale_linetype()

Scale

library(ggplot2)
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank() +
  scale_x_continuous(limits = c(0,20), breaks = seq(0,20,5)) +
  scale_y_continuous(limits = c(0,20), breaks = seq(0,20,5))

Geometric objects

What physical features we actually put in the plot area.

The most vocabulary-rich layer.

Points
- geom_point()
Lines
- geom_line(), geom_hline(), geom_vline(), geom_abline(), geom_contour(), geom_path(), geom_segment()
Bars
- geom_bar() or geom_col()
Polygons
- geom_polygon(), geom_rect(), etc.
Tiles
- geom_tile(), geom_rect(), geom_raster(), etc.
Text, annotations
- geom_text(), geom_label(), annotate()

Geometric objects

ggplot(data = anscombe, aes(x = x1, y = y1)) + 
  geom_point()

Statistical objects

Geoms with statistical properties.

Boxplots
- geom_boxplot()
Histograms
- geom_hist()
Distributions
- geom_density(), geom_rug(), geom_violin()
Error
- geom_errorbar(), geom_ribbon(), stat_summary()
Rugs
- geom_rug()
Other statistical summaries
- geom_smooth(), stat_summary()

Statistical objects

lm(y1~x1, data = anscombe)

## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Coefficients:
## (Intercept)           x1  
##      3.0001       0.5001

Statistical objects

ggplot(data = anscombe, aes(x = x1, y = y1)) +
  geom_point() +
  geom_abline(intercept = 3.0001, slope = 0.5001)

Facets

How can we reproduce Anscombe’s original quartet figure?

# first we need to reshape the data
anscombe.quartet <- data.frame(X = c(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4),
                               Y = c(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4),
                               dataset = c(rep("I", 11), rep("II", 11), 
                                           rep("III", 11), rep("IV", 11)))
head(anscombe.quartet)

##    X    Y dataset
## 1 10 8.04       I
## 2  8 6.95       I
## 3 13 7.58       I
## 4  9 8.81       I
## 5 11 8.33       I
## 6 14 9.96       I

Facets

Small multiples with facet_wrap() or facet_grid()

ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(~dataset)

Coordinate system

Set of position scales combined with their geometrical arrangement

Cartesian: 2 dimensional (X and Y), linear
- coord_cartesian() (default), coord_equal(), coord_fixed()
Nonlinear (e.g., a log transformed axis)
- coord_trans(), coord_munch()
Curved (e.g., polar coordinate system) - good for periodic data, geospatial data
- coord_polar()

Coordinate system

Yes, that includes spatial data (we can spend a whole day on spatial data).

Extra syntax: legends

To include or not?

When is it redundant, when is it useful?
Where to put them?
Not always necessary to have a standalone legend (for example, below).

Extra syntax: themes

In ggplot we can modify virtually any thematic plot component with theme() and labs()

Axes, ticks, background grids—theme()
Axis labels, plot title—labs()

newthemes <- 
  ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(.~dataset) + 
  labs(x = "new x-axis label", y = "new y-axis label", title = "Fancy title") +
  theme(axis.title = element_text(size = 16),
        axis.text = element_text(size = 14))

Extra syntax: themes

newthemes

Extra syntax: themes

We can also use some pre-installed themes from ggplot and other packages like ggthemes.

theme_gray() (default)

default <- 
  ggplot(data = anscombe.quartet, aes(x = X, y = Y)) +
  geom_point(color = "midnightblue") +
  geom_abline(intercept = 3.0001, slope = 0.5001) +
  facet_wrap(.~dataset) + ggtitle("default") + theme_gray()

Extra syntax: themes

default

Extra syntax: themes

Other ggplot2 pre-installed themes.

bw <- default + theme_bw() + ggtitle("theme_bw()")
classic <- default + theme_classic() + ggtitle("theme_classic()")
minimal <- default + theme_minimal() + ggtitle("theme_minimal()")

library(ggpubr)
theme.plots <- ggarrange(default, bw, classic, minimal, nrow = 2, ncol = 2)

Extra syntax: themes

theme.plots

Extra syntax: themes

Some themes from the ggthemes package.

library(ggthemes)
economist <- default + theme_economist() + ggtitle("theme_economist()")
wsj <- default + theme_wsj() + ggtitle("theme_wsj()")
fivethirtyeight <- default + theme_fivethirtyeight() + ggtitle("theme_fivethirtyeight()") 
tufte <- default + theme_tufte() + ggtitle("theme_tufte()")

ggthemes.plots <- ggarrange(economist, wsj, fivethirtyeight, tufte, 
                            nrow = 2, ncol = 2, align = "hv")

Extra syntax: themes

ggthemes.plots

Extra syntax: themes

You can even make an xkcd style figure:

days <- seq(0,10,1)
stench <- days - 2.31*I(days^2) + 0.15*I(days^3) + 11*days
library(xkcd)
library(extrafont)

stench.plot <- 
  ggplot() + geom_line(aes(x = days, y = stench), color = "steelblue", lwd = 2) + 
  xkcdaxis(xrange = c(0,10), yrange = range(stench)) + 
  labs(x = "Days since last shower", y = "Stench", title = "The stench curve") +
  geom_segment(data = NULL, aes(x = 8, y = 25, xend = 8, yend = 40), lty = "dashed") +
  geom_segment(data = NULL, aes(x = 8, y = 25, xend = 10, yend = 25), lty = "dashed") +
  xkcdline(aes(x=4.5,y=22,xend=4.5,yend=24), data = NULL, xjitteramount = 0.12) +
  xkcdline(aes(x=9,y=24,xend=9,yend=20), data = NULL, xjitteramount = 0.12) +
  scale_x_continuous(breaks = 0:10) +
  annotate("text", x = 4.5, y = 26, label = "poop plateau", family="xkcd") +
  annotate("text", x = 9, y = 18, label = "point of no return", family="xkcd") +
  theme(text=element_text(size=16, family="xkcd"),
        axis.text.y = element_blank(), axis.ticks.y = element_blank())

Extra syntax: themes

stench.plot

The empirically-driven hiker stench curve (Goldspiel & Fuller, 2018).

Colors

http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

How to use effectively
Opacity is everything (alpha = ...)
Colorblind considerations
- ColorBrewer: http://colorbrewer2.org/

Color palettes

Essentials:
- Scico
- Viridis
Other fun ones
- Wes Anderson
- Ghibli
- Lisa
The master reference
- R color palette aggregator

Or make your own palette with RColorBrewer!

Color palettes—examples

library(RColorBrewer) 
display.brewer.all(colorblindFriendly = TRUE) # colorblind friendly palettes

Color palettes

# the good old iris data (default colors)
def <- ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(fill = Species)) + 
  theme_minimal() + theme(legend.position = "top") + ggtitle("default")
# max ernst colors
library(lisa)
dali <- def + scale_fill_manual(values = lisa$MaxErnst) + ggtitle("Woman, Old Man, and Flower")
ggarrange(def, dali, ncol = 2)

Efficient graphics

Only including elements you need to convey your data, and nothing more
“Data-ink” versus “chartjunk”
Edward Tufte, the “data-ink ratio”

Efficient graphics

Spagetti disaster.

Things to avoid:
- Gratuitious use of colors
- Overuse of precision (e.g., 3.1415926535897932384626433…)
  - Don’t go beyond three decimal places
  - Be careful with number of axis ticks and labels

Efficient graphics

Climate trends in Galapagos.

Efficient graphics

Graphs are not self-promoting. They are used to convey quantitative information, not stylistic information. Effective graphs can (and should) be pretty, but focus should always be on function, not decoration. Decorative graphs are “ducks.”

Don’t make a duck.

Efficient graphics

When do you actually need bars?
- Answer: Never. Bars are data summaries, and you can more efficiently (and effectively) convey data summaries with points and supplementary geoms to visualize distributions and uncertainty.

Honest graphics

Not distorting axes
Bergstrom & West (2016) “The theory of proportional ink”: using responsible sizes for shaded areas (e.g., bars, polygons). In a visualization, size should be proportional to the nature of the actual data.

Ethical graphs

Don’t include personal beliefs and biases in graphs.

Chiou and Bergey (2018)

Even if you think they are funny (https://tinyurl.com/sae2vpx).

Other resources:

Part II: Applications

Data to viz

The ultimate guide: https://www.data-to-viz.com/

Let’s go through this guide and show some examples of how to make these figures in R.

A brief primer on data viz

Core material

Why visualize?

What actually makes a figure?

Grammar of Graphics—Wilkinson (1999)

Grammar of Graphics—Wilkinson (1999)

Layered Grammar of Graphics—Wickham (2005)

Layered Grammar of Graphics—Wickham (2005)

Layers of grammar

Data

Aesthetics

Aesthetics

Scale

Scale

Geometric objects

Geometric objects

Statistical objects

Statistical objects

Statistical objects

Facets

Facets

Facets

Coordinate system

Coordinate system

Extra syntax: legends

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Extra syntax: themes

Colors

Color palettes

Color palettes—examples

Color palettes

Efficient graphics

Efficient graphics

Efficient graphics

Efficient graphics

Efficient graphics

Honest graphics

Ethical graphs

Other resources:

general figure tips

a piece by piece demo of building a nuanced plot in ggplot

sthda—great ggplot2 tutortials

10 rules for better figures

Part II: Applications

Data to viz

a piece by piece demo of building a nuanced plot in `ggplot`

sthda—great `ggplot2` tutortials