14 June, 2020
https://serialmentor.com/dataviz/proportional-ink.html
A framework for describing components of a figure and the structure that underlies statistical graphics.
Hadley Wickham built on Wilkinson’s grammatical framework, describing a layered system that:
https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf
Most commonly as a data frame (ggplot
requires this structure)
data(anscombe) # let's continue with the Anscombe example library(stargazer) stargazer(anscombe, type = "html", summary = FALSE)
x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 | |
1 | 10 | 10 | 10 | 8 | 8.040 | 9.140 | 7.460 | 6.580 |
2 | 8 | 8 | 8 | 8 | 6.950 | 8.140 | 6.770 | 5.760 |
3 | 13 | 13 | 13 | 8 | 7.580 | 8.740 | 12.740 | 7.710 |
4 | 9 | 9 | 9 | 8 | 8.810 | 8.770 | 7.110 | 8.840 |
5 | 11 | 11 | 11 | 8 | 8.330 | 9.260 | 7.810 | 8.470 |
6 | 14 | 14 | 14 | 8 | 9.960 | 8.100 | 8.840 | 7.040 |
7 | 6 | 6 | 6 | 8 | 7.240 | 6.130 | 6.080 | 5.250 |
8 | 4 | 4 | 4 | 19 | 4.260 | 3.100 | 5.390 | 12.500 |
9 | 12 | 12 | 12 | 8 | 10.840 | 9.130 | 8.150 | 5.560 |
10 | 7 | 7 | 7 | 8 | 4.820 | 7.260 | 6.420 | 7.910 |
11 | 5 | 5 | 5 | 8 | 5.680 | 4.740 | 5.730 | 6.890 |
Every aspect of a given graphic element
library(ggplot2) ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank()
The mapping between data values and aesthetic values
We can scale any aesthetic component to our liking based on our data:
scale_y_discrete()
, scale_y_continuous()
, scale_y_date()
scale_shape()
scale_size()
scale_color_manual()
, scale_fill_manual()
scale_size()
, scale_linetype()
library(ggplot2) ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_blank() + scale_x_continuous(limits = c(0,20), breaks = seq(0,20,5)) + scale_y_continuous(limits = c(0,20), breaks = seq(0,20,5))
What physical features we actually put in the plot area.
The most vocabulary-rich layer.
geom_point()
geom_line()
, geom_hline()
, geom_vline()
, geom_abline()
, geom_contour()
, geom_path()
, geom_segment()
geom_bar()
or geom_col()
geom_polygon()
, geom_rect()
, etc.geom_tile()
, geom_rect()
, geom_raster()
, etc.geom_text()
, geom_label()
, annotate()
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point()
Geoms with statistical properties.
geom_boxplot()
geom_hist()
geom_density()
, geom_rug()
, geom_violin()
geom_errorbar()
, geom_ribbon()
, stat_summary()
geom_rug()
geom_smooth()
, stat_summary()
lm(y1~x1, data = anscombe)
## ## Call: ## lm(formula = y1 ~ x1, data = anscombe) ## ## Coefficients: ## (Intercept) x1 ## 3.0001 0.5001
ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point() + geom_abline(intercept = 3.0001, slope = 0.5001)
Small multiples!
Going beyond two dimensions.
Small exercise: how many dimensions can you plot at once?
Answer: You can easily plot SEVEN dimensions in R and still have a reasonably clear figure if you include two axes (X, Y), two facet structures, one color aesthetic, one shape aesthetic, and one size aesthetic.
Eight dimensions if you do all of the above with some time animation (ggannimate
).
Nine dimensions if you do all that add a third axis (e.g., a surface plot with a Z variable), but doubtful this will be useful to many people.
Ten dimensions if you use the secret spacetime()
function in developer mode to distort general relativity in your data visualization and open a portal to graphics hell.
How can we reproduce Anscombe’s original quartet figure?
# first we need to reshape the data anscombe.quartet <- data.frame(X = c(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4), Y = c(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4), dataset = c(rep("I", 11), rep("II", 11), rep("III", 11), rep("IV", 11))) head(anscombe.quartet)
## X Y dataset ## 1 10 8.04 I ## 2 8 6.95 I ## 3 13 7.58 I ## 4 9 8.81 I ## 5 11 8.33 I ## 6 14 9.96 I
Small multiples with facet_wrap()
or facet_grid()
ggplot(data = anscombe.quartet, aes(x = X, y = Y)) + geom_point(color = "midnightblue") + geom_abline(intercept = 3.0001, slope = 0.5001) + facet_wrap(~dataset)
Set of position scales combined with their geometrical arrangement
coord_cartesian()
(default), coord_equal()
, coord_fixed()
coord_trans()
, coord_munch()
coord_polar()
Yes, that includes spatial data (we can spend a whole day on spatial data).
To include or not?
When is it redundant, when is it useful?
Where to put them?
Not always necessary to have a standalone legend (for example, below).
In ggplot
we can modify virtually any thematic plot component with theme()
and labs()
Axes, ticks, background grids—theme()
Axis labels, plot title—labs()
newthemes <- ggplot(data = anscombe.quartet, aes(x = X, y = Y)) + geom_point(color = "midnightblue") + geom_abline(intercept = 3.0001, slope = 0.5001) + facet_wrap(.~dataset) + labs(x = "new x-axis label", y = "new y-axis label", title = "Fancy title") + theme(axis.title = element_text(size = 16), axis.text = element_text(size = 14))
newthemes
We can also use some pre-installed themes from ggplot
and other packages like ggthemes
.
theme_gray()
(default)
default <- ggplot(data = anscombe.quartet, aes(x = X, y = Y)) + geom_point(color = "midnightblue") + geom_abline(intercept = 3.0001, slope = 0.5001) + facet_wrap(.~dataset) + ggtitle("default") + theme_gray()
default
Other ggplot2
pre-installed themes.
bw <- default + theme_bw() + ggtitle("theme_bw()") classic <- default + theme_classic() + ggtitle("theme_classic()") minimal <- default + theme_minimal() + ggtitle("theme_minimal()") library(ggpubr) theme.plots <- ggarrange(default, bw, classic, minimal, nrow = 2, ncol = 2)
theme.plots
Some themes from the ggthemes
package.
library(ggthemes) economist <- default + theme_economist() + ggtitle("theme_economist()") wsj <- default + theme_wsj() + ggtitle("theme_wsj()") fivethirtyeight <- default + theme_fivethirtyeight() + ggtitle("theme_fivethirtyeight()") tufte <- default + theme_tufte() + ggtitle("theme_tufte()") ggthemes.plots <- ggarrange(economist, wsj, fivethirtyeight, tufte, nrow = 2, ncol = 2, align = "hv")
ggthemes.plots
You can even make an xkcd style figure:
days <- seq(0,10,1) stench <- days - 2.31*I(days^2) + 0.15*I(days^3) + 11*days library(xkcd) library(extrafont) stench.plot <- ggplot() + geom_line(aes(x = days, y = stench), color = "steelblue", lwd = 2) + xkcdaxis(xrange = c(0,10), yrange = range(stench)) + labs(x = "Days since last shower", y = "Stench", title = "The stench curve") + geom_segment(data = NULL, aes(x = 8, y = 25, xend = 8, yend = 40), lty = "dashed") + geom_segment(data = NULL, aes(x = 8, y = 25, xend = 10, yend = 25), lty = "dashed") + xkcdline(aes(x=4.5,y=22,xend=4.5,yend=24), data = NULL, xjitteramount = 0.12) + xkcdline(aes(x=9,y=24,xend=9,yend=20), data = NULL, xjitteramount = 0.12) + scale_x_continuous(breaks = 0:10) + annotate("text", x = 4.5, y = 26, label = "poop plateau", family="xkcd") + annotate("text", x = 9, y = 18, label = "point of no return", family="xkcd") + theme(text=element_text(size=16, family="xkcd"), axis.text.y = element_blank(), axis.ticks.y = element_blank())
stench.plot
The empirically-driven hiker stench curve (Goldspiel & Fuller, 2018).
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
How to use effectively
Opacity is everything (alpha = ...
)
Colorblind considerations
RColorBrewer
!library(RColorBrewer) display.brewer.all(colorblindFriendly = TRUE) # colorblind friendly palettes
# the good old iris data (default colors) def <- ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(fill = Species)) + theme_minimal() + theme(legend.position = "top") + ggtitle("default") # max ernst colors library(lisa) dali <- def + scale_fill_manual(values = lisa$MaxErnst) + ggtitle("Woman, Old Man, and Flower") ggarrange(def, dali, ncol = 2)
Only including elements you need to convey your data, and nothing more
“Data-ink” versus “chartjunk”
Edward Tufte, the “data-ink ratio”
Spagetti disaster.
Climate trends in Galapagos.
Graphs are not self-promoting. They are used to convey quantitative information, not stylistic information. Effective graphs can (and should) be pretty, but focus should always be on function, not decoration. Decorative graphs are “ducks.”
When do you actually need bars?
Not distorting axes
Bergstrom & West (2016) “The theory of proportional ink”: using responsible sizes for shaded areas (e.g., bars, polygons). In a visualization, size should be proportional to the nature of the actual data.
Don’t include personal beliefs and biases in graphs.
Chiou and Bergey (2018)
https://socviz.co/refineplots.html
ggplot
https://www.kdnuggets.com/2019/07/evolution-ggplot.html
ggplot2
tutortialshttp://www.sthda.com/english/wiki/data-visualization
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833
The ultimate guide: https://www.data-to-viz.com/
Let’s go through this guide and show some examples of how to make these figures in R.