John P drew our attention to various species that were not shown as common in the NVC standards, but which we frequently find in our sites, notwithstanding they have been assessed as belonging to a community that “should” not have them at high frequency. Prominent among these is Stellaria graminea. The purpose of this notebook is to track the development of an application to find these unconforming species.
Start by loading the data
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
source("db_extract.R")
## Loading required package: DBI
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## `summarise()` has grouped output by 'species_id'. You can override using the
## `.groups` argument.
the_data <- GetTheData()
And we’re going to need species_frequency by survey; but maybe could decide to go by community?
frequency_by_survey <- FrequencyBysurvey(the_data)
## `summarise()` has grouped output by 'survey_id'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'survey_id'. You can override using the
## `.groups` argument.
frequency_by_community <- FrequencyByCommunity(the_data)
## `summarise()` has grouped output by 'community', 'species_name'. You can
## override using the `.groups` argument.
Let’s plot mean frequency by survey vs frequency by community for Stellaria graminea:
fba <- frequency_by_survey %>% select(survey_id, species_name, community, freq)
fbc <- frequency_by_community %>% select(species_name,community, freq)
jf <- left_join(fbc, fba, by = c("community", "species_name")) %>% filter(species_name == "Stellaria_graminea" & grepl("MG", community))
f <- ggplot(jf, aes(freq.x, freq.y, colour = community)) +
geom_point()+
xlab("frequency by community") + ylab("frequency by survey")+
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1, colour ="red"))
print(f)
This is for S. graminea and the MG communities. So (a) there are lots more surveys than there are communities (of course) and (b) frequency by survey tends to be larger than frequency by community; aggregating over community has smoothed out a lot of variation. My feeling is that it will be more illuminating to eliminate the survey variable to start with, we can explore variability with survey later.
Next question: how to compare with standard values. Explore select into table mg_rodwell in the meadows database … in which we have columns community, species_id (and name, but not to be relied on) and p_central, the central frequency of the categories I .. V.
Stellaria graminea id 139. It so happens that p_central == 0.1 for S. graminea for all MG communities: SELECT Community, species_id, p_central FROM meadows.mg_rodwell where Community like “MG%” && species_id = 139; ( I wonder whether I can communicate directly with the DB from here using SQL?).
Anyway, it seems quite possible to collect the p_central values for all species and plot against frequency by community
q <- "SELECT Community, species_id, p_central FROM meadows.mg_rodwell where Community like 'MG%';"
std_freqs <- query(q)
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
and match up with species frequencies by community
rm(jf) # We don't need it any more
jf1 <- left_join(frequency_by_community, std_freqs, by = c("community"="Community", "species_id"))
Lots of NAs here; (a) because mires (M) are included in the survey data but not in the mg_rodwell table - yet - because as its name implies, it is just the MG communities. And (b), because it seems we have detected some species that the standards don’t list in those communities at all. So (a), filter out all the mires; (b) set the remaining NAs to zero.
# Would be easier if we'd excluded mires from the_data to start with!
jf2 <- jf1 %>% filter(grepl("MG", community)) %>% replace_na(list(p_central = 0))
rm(jf1)
So now we want filter to include only cases where CrI5 > p_central or CrI95 < p_central, and then plot (survey) frequencies against standard frequencies (p_central)
jf3 <- jf2 %>% filter(CrI5 > p_central || CrI95 < p_central)
f2 <- ggplot(jf3, aes(p_central, freq, colour = community)) +
geom_jitter(aes(text = species_name), size = 3) + #, name = "species_name")+
xlab("Standard frequency") + ylab("Survey frequency")+
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1, colour ="red"))
## Warning: Ignoring unknown aesthetics: text
ggplotly(f2) %>% layout(legend = list(orientation = "h", x = 0.4, y = -0.2))
This gives a surprising number of species in strong disagrrement with the standard (we may ignore the big cluster near (0, 0) because of the way CrI5 is calculated - it can be very near zero, but not actually zero). Interactive plot: is this useful?
Even so, there’s too many species here for this to be a useful way of exploring the data. Filter out low frequency species, and then perhaps present the data as an interactive table.
There are 22990 records. Stellaria graminea accounts for 540 or 2.3488473%, so suggest reducing the data to species with less than 2% of the count. To start with, we have 300 species.
d1 <- (the_data
%>% select(records_id, species_id)
%>% group_by(species_id)
%>% summarise(cnt = n())
%>% mutate(frac = cnt/nrow(the_data))
%>% filter(frac >= 0.02))
Removing those accounting for less than 2% of the records leaves 13 species. Make the reduced dataset and plot survey frequency vs standard again, coded by species:
reduced <- left_join(d1, the_data, by = "species_id")
jf4 <- left_join(d1, jf3, by = "species_id")
f3 <- ggplot(jf4, aes(p_central, freq, colour = species_name)) +
geom_jitter(aes(text = community), size = 3) +
xlab("Standard frequency") + ylab("Survey frequency")+
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), colour ="red")
## Warning: Ignoring unknown aesthetics: text
ggplotly(f3) %>% layout(legend = list(orientation = "h", x = 0.4, y = -0.2))
This looks much more useful. Community is now available in the interactive part. Using 13 classes: on reflection (a) it might be better to use the top 12 classes (allowing to use ColorBrewer Paired Class 12 for the legend); and (b), ultimately to analyse by community (group, as used for poster BES2019) and select the top twelve species for each community.
Doing (b) first: sort out mg5a data:
cf <- head(the_data
%>% filter(community == "MG5a")
%>% group_by(species_id)
%>% summarise(hits = n()) %>% arrange(desc(hits)), 12)
d_mg5a <- the_data %>% filter(community == "MG5a") %>% right_join(cf, by = "species_id")
fc_mg5a <- FrequencyByCommunity(d_mg5a)
## `summarise()` has grouped output by 'community', 'species_name'. You can
## override using the `.groups` argument.
# Add standard frequencies
fc_mg5a <- (left_join(fc_mg5a, std_freqs, by = c("community"="Community", "species_id"))
%>% mutate(std_freq = replace_na(p_central, 0))
%>% select(-p_central))
f4 <- ggplot(fc_mg5a, aes(std_freq, freq, colour = species_name)) +
geom_pointrange(aes(ymin = CrI5, ymax = CrI95, text = community), size = 3) +
scale_colour_brewer(palette="Paired") +
xlab("Standard frequency") + ylab("Survey frequency") +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), colour ="red")
## Warning: Ignoring unknown aesthetics: text
ggplotly(f4) %>% layout(legend = list(orientation = "h", x = 0.4, y = -0.2))
The credibility intervals are tiny because we are working with 200 - 300 hits and over 400 trials. It will be interesting to see how this changes on an survey basis, and in communities where we have less data.
Let’s try the same thing but on an survey basis:
fa_mg5a <- FrequencyBysurvey(d_mg5a)
## `summarise()` has grouped output by 'survey_id'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'survey_id'. You can override using the
## `.groups` argument.
# Add standard frequencies
fa_mg5a <- (left_join(fa_mg5a, std_freqs, by = c("community"="Community", "species_id"))
%>% filter(!is.na(species_id))
%>% filter(!is.na(community))
%>% mutate(std_freq = replace_na(p_central, 0))
%>% select(-p_central))
pd <- position_dodge(0.2)
f5 <- ggplot(fa_mg5a, aes(std_freq, freq, colour = species_name)) +
geom_pointrange(aes(text = paste(assembly_name, community), ymin = CrI5, ymax = CrI95), size = 1, position = pd) +
scale_colour_brewer(palette="Paired") +
xlab("Standard frequency") + ylab("Survey frequency") +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), colour ="red")
## Warning: Ignoring unknown aesthetics: text
ggplotly(f5) %>% layout(legend = list(orientation = "h", x = 0.4, y = -0.2))
Which certainly makes the point about the credibility intervals, but is not otherwise very useful, so have to think of another way to deal with survey-level analysis.
Single survey is nice:
f6 <- ggplot(head(fa_mg5a, 11), aes(std_freq, freq, colour = species_name)) +
geom_pointrange(aes(text = assembly_name, ymin = CrI5, ymax = CrI95), size = 1, position = pd) +
scale_colour_brewer(palette="Paired") +
xlab("Standard frequency") + ylab("Survey frequency") +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), colour ="red")
## Warning: Ignoring unknown aesthetics: text
ggplotly(f6) %>% layout(legend = list(orientation = "h", x = 0.4, y = -0.2))
but really needs interactivity so we can search survey-by-survey. Bear in mind we have total 251 surveys, of which 75 are MG5a.
Looks like this will need special treatment with a Shiny app, beyond the scope of this exploratory notebook.