How would you plot the proportion of observations by a categorical variable, for multiple facets of your data? Here we do it with the tidyverse.
library(tidyverse)
#> -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.1.0 v purrr 0.2.5
#> v tibble 2.0.0 v dplyr 0.7.8
#> v tidyr 0.8.2 v stringr 1.3.1
#> v readr 1.3.1 v forcats 0.3.0
#> -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
We use a (private) dataset on the damage recorded on dead and alive trees at three different sites.
#> # A tibble: 1,842 x 9
#> tag sp dbh status b_u break_h crown_status comments site
#> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 3 CALCAL 17 A B 5 NA BP=236, STIL~ EV1
#> 2 5 RHEPOR 10 A U NA 40 NRP EV1
#> 3 8 HOMRAC 32 A B 8 NA <NA> EV1
#> 4 9 CASARB 14 A B 6.5 NA <NA> EV1
#> 5 10 MICTET 10 D B 6 NA <NA> EV1
#> 6 14 OCOLEU 12 A U NA 20 NRP EV1
#> 7 17 COCSWA 14 A B/U 9 30 H=1.40, W=2.~ EV1
#> 8 18 COCSWA 11 A <NA> NA 70 <NA> EV1
#> 9 34 MIRCHR 33 A <NA> NA 30 <NA> EV1
#> 10 65 COCSWA 10 A U NA 70 H=0.80, W=1.~ EV1
#> # ... with 1,832 more rows
Clean data
(This section is mainly cosmetic – not the main point of this post.)
We recode the values of three variables to make them more informative.
dmg <- dmg %>%
mutate(
site = case_when(
site == "EV1" ~ "M1",
site == "SB1" ~ "M2",
site == "SB3" ~ "OG"
)
) %>%
mutate(
b_u = case_when(
b_u == "B" ~ "broken",
b_u == "U" ~ "uprooted",
b_u == "B/U" ~ "both",
is.na(b_u) ~ "none"
)
) %>%
mutate(
status = case_when(
status == "A" ~ "Alive",
status == "D" ~ "Dead"
)
)
We also rename a variable, make it a factor, and reorder its levels, so that they appear in the plot in the order that best tells the story we want to tell with this dataset.
dmg <- dmg %>%
rename(stem_damage = b_u) %>%
mutate(stem_damage = lvls_reorder(factor(stem_damage), c(4, 2, 1, 3)))
dmg
#> # A tibble: 1,842 x 9
#> tag sp dbh status stem_damage break_h crown_status comments site
#> <dbl> <chr> <dbl> <chr> <fct> <dbl> <dbl> <chr> <chr>
#> 1 3 CALC~ 17 Alive broken 5 NA BP=236,~ M1
#> 2 5 RHEP~ 10 Alive uprooted NA 40 NRP M1
#> 3 8 HOMR~ 32 Alive broken 8 NA <NA> M1
#> 4 9 CASA~ 14 Alive broken 6.5 NA <NA> M1
#> 5 10 MICT~ 10 Dead broken 6 NA <NA> M1
#> 6 14 OCOL~ 12 Alive uprooted NA 20 NRP M1
#> 7 17 COCS~ 14 Alive both 9 30 H=1.40,~ M1
#> 8 18 COCS~ 11 Alive none NA 70 <NA> M1
#> 9 34 MIRC~ 33 Alive none NA 30 <NA> M1
#> 10 65 COCS~ 10 Alive uprooted NA 70 H=0.80,~ M1
#> # ... with 1,832 more rows
Now we compute on each site separately. We create a helper function to calculate the proportion of stem damage in each status category. Our helper does the computation for one site.
# Calculate the proportion of each damage category in a single site
proportion_each_site <- function(dmg_site) {
dmg_site %>%
select(status, stem_damage) %>%
count(status, stem_damage) %>%
# The column `n` results from `count()`
mutate(total = sum(n, na.rm = TRUE), proportion = n / total)
}
Now we apply the helper function to each site with map()
. We use nest()
to work within a dataframe.
dmg_prop <- dmg %>%
nest(-site) %>%
mutate(data = map(data, ~proportion_each_site(.x))) %>%
unnest()
dmg_prop
#> # A tibble: 24 x 6
#> site status stem_damage n total proportion
#> <chr> <chr> <fct> <int> <int> <dbl>
#> 1 M1 Alive uprooted 83 1214 0.0684
#> 2 M1 Alive broken 155 1214 0.128
#> 3 M1 Alive both 8 1214 0.00659
#> 4 M1 Alive none 824 1214 0.679
#> 5 M1 Dead uprooted 32 1214 0.0264
#> 6 M1 Dead broken 55 1214 0.0453
#> 7 M1 Dead both 5 1214 0.00412
#> 8 M1 Dead none 52 1214 0.0428
#> 9 M2 Alive uprooted 34 331 0.103
#> 10 M2 Alive broken 90 331 0.272
#> # ... with 14 more rows
Finally we plot each facet of our data in a different panel.
dmg_prop %>%
ggplot(aes(status, proportion)) +
geom_col(aes(fill = stem_damage)) +
facet_wrap(vars(site))
Note that the values within each panel add to 1.
What did NOT work
We first calculated the proportion for each site directly within the ggplot with stat()
, hoping that the calculation would be applied to each facet of the data independently. But this didn’t work: The proportion was calculated not for each facet independently but for the entire dataset.
ggplot(dmg, aes(status)) +
facet_wrap(vars(site)) +
geom_bar(aes(fill = stem_damage, y = stat(count / sum(count)))) +
labs(y = "proportion")
If you know how to adapt this approach to plot what we want please let us know.