How would you plot the proportion of observations by a categorical variable, for multiple facets of your data? Here we do it with the tidyverse.

library(tidyverse)
#> -- Attaching packages ------------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.5
#> v tibble  1.4.2     v dplyr   0.7.6
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.3.0
#> -- Conflicts --------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

We use a (private) dataset on the damage recorded on dead and alive trees at three different sites.

#> # A tibble: 1,842 x 9
#>      tag sp       dbh status b_u   break_h crown_status comments     site 
#>    <int> <chr>  <int> <chr>  <chr>   <dbl>        <int> <chr>        <chr>
#>  1     3 CALCAL    17 A      B         5             NA BP=236, STI~ EV1  
#>  2     5 RHEPOR    10 A      U        NA             40 NRP          EV1  
#>  3     8 HOMRAC    32 A      B         8             NA <NA>         EV1  
#>  4     9 CASARB    14 A      B         6.5           NA <NA>         EV1  
#>  5    10 MICTET    10 D      B         6             NA <NA>         EV1  
#>  6    14 OCOLEU    12 A      U        NA             20 NRP          EV1  
#>  7    17 COCSWA    14 A      B/U       9             30 H=1.40, W=2~ EV1  
#>  8    18 COCSWA    11 A      <NA>     NA             70 <NA>         EV1  
#>  9    34 MIRCHR    33 A      <NA>     NA             30 <NA>         EV1  
#> 10    65 COCSWA    10 A      U        NA             70 H=0.80, W=1~ EV1  
#> # ... with 1,832 more rows

Clean data

(This section is mainly cosmetic – not the main point of this post.)

We recode the values of three variables to make them more informative.

dmg <- dmg %>% 
  mutate(
    site = case_when(
      site == "EV1" ~ "M1",
      site == "SB1" ~ "M2",
      site == "SB3" ~ "OG"
    )
  ) %>% 
  mutate(
    b_u = case_when(
      b_u == "B" ~ "broken",
      b_u == "U" ~ "uprooted",
      b_u == "B/U" ~ "both",
      is.na(b_u) ~ "none"
    )
  ) %>% 
  mutate(
    status = case_when(
      status == "A" ~ "Alive",
      status == "D" ~ "Dead"
    )
  )

We also rename a variable, make it a factor, and reorder its levels, so that they appear in the plot in the order that best tells the story we want to tell with this dataset.

dmg <- dmg %>% 
  rename(stem_damage = b_u) %>%
  mutate(stem_damage = lvls_reorder(factor(stem_damage), c(4, 2, 1, 3)))
dmg
#> # A tibble: 1,842 x 9
#>      tag sp      dbh status stem_damage break_h crown_status commen~ site 
#>    <int> <chr> <int> <chr>  <fct>         <dbl>        <int> <chr>   <chr>
#>  1     3 CALC~    17 Alive  broken          5             NA BP=236~ M1   
#>  2     5 RHEP~    10 Alive  uprooted       NA             40 NRP     M1   
#>  3     8 HOMR~    32 Alive  broken          8             NA <NA>    M1   
#>  4     9 CASA~    14 Alive  broken          6.5           NA <NA>    M1   
#>  5    10 MICT~    10 Dead   broken          6             NA <NA>    M1   
#>  6    14 OCOL~    12 Alive  uprooted       NA             20 NRP     M1   
#>  7    17 COCS~    14 Alive  both            9             30 H=1.40~ M1   
#>  8    18 COCS~    11 Alive  none           NA             70 <NA>    M1   
#>  9    34 MIRC~    33 Alive  none           NA             30 <NA>    M1   
#> 10    65 COCS~    10 Alive  uprooted       NA             70 H=0.80~ M1   
#> # ... with 1,832 more rows

Now we compute on each site separately. We create a helper function to calculate the proportion of stem damage in each status category. Our helper does the computation for one site.

# Calculate the proportion of each damage category in a single site
proportion_each_site <- function(dmg_site) {
  dmg_site %>% 
    select(status, stem_damage) %>% 
    count(status, stem_damage) %>% 
    # The column `n` results from `count()`
    mutate(total = sum(n, na.rm = TRUE), proportion = n / total)
}

Now we apply the helper function to each site with map(). We use nest() to work within a dataframe.

dmg_prop <- dmg %>% 
  nest(-site) %>% 
  mutate(data = map(data, ~proportion_each_site(.x))) %>% 
  unnest()
dmg_prop
#> # A tibble: 24 x 6
#>    site  status stem_damage     n total proportion
#>    <chr> <chr>  <fct>       <int> <int>      <dbl>
#>  1 M1    Alive  uprooted       83  1214    0.0684 
#>  2 M1    Alive  broken        155  1214    0.128  
#>  3 M1    Alive  both            8  1214    0.00659
#>  4 M1    Alive  none          824  1214    0.679  
#>  5 M1    Dead   uprooted       32  1214    0.0264 
#>  6 M1    Dead   broken         55  1214    0.0453 
#>  7 M1    Dead   both            5  1214    0.00412
#>  8 M1    Dead   none           52  1214    0.0428 
#>  9 M2    Alive  uprooted       34   331    0.103  
#> 10 M2    Alive  broken         90   331    0.272  
#> # ... with 14 more rows

Finally we plot each facet of our data in a different panel.

dmg_prop %>% 
  ggplot(aes(status, proportion)) +
    geom_col(aes(fill = stem_damage)) +
    facet_wrap(vars(site))

Note that the values within each panel add to 1.

What did NOT work

We first calculated the proportion for each site directly within the ggplot with stat(), hoping that the calculation would be applied to each facet of the data independently. But this didn’t work: The proportion was calculated not for each facet independently but for the entire dataset.

ggplot(dmg, aes(status)) +
  facet_wrap(vars(site)) +
  geom_bar(aes(fill = stem_damage, y = stat(count / sum(count)))) +
  labs(y = "proportion")

If you know how to adapt this approach to plot what we want please let us know.