QC vegetation data
Nathan Roe
2024-12-11
Source:vignettes/qc_vegetation_data.Rmd
qc_vegetation_data.Rmd
library
About
ecositer
has a category of functions used for QC, all of
which use a “QC_” prefix. Vegetation data stored in NASIS requires QC.
There are numerous issues that exist caused by different data vintages,
data origins, varying data quality, duplication caused by KSSL lab
pedons, outdated plant taxonomies, and more. This vignette discusses a
best practice methodology for QC NASIS vegetation data.
You should start your QC process by removing data at the coarsest level that meets your objectives. This will ensure that you do not spend time fixing obscure edge cases that are ultimately removed.
To begin with, we are going to build our dataset:
D104_veg <- ecositer::create_veg_df(from = "web_report",
ecositeid = "F022AD104CA")
Choose the best vegetation plot for site
Sites may have multiple vegetation plots with differing data
qualities. If there are multiple vegetation plots associated with a
site, QC_best_vegplot_for_site()
chooses the vegetation
plot with the most records.
D104_veg_best <- ecositer::QC_best_vegplot_for_site(veg_df = D104_veg)
## Warning -> There are sites with multiple vegetation plots. Reviewing these sites is preferable to automated selection. To view which sites have multiple vegetation plots:
## 'Your veg_df' |> dplyr::group_by(siteiid) |>
## dplyr::summarise(unique_vegplots = dplyr::n_distinct(vegplotiid)) |>
## dplyr::filter(unique_vegplots > 1)
To look at all the sites with multiple vegetation plots:
D104_multi_veg <- D104_veg |> dplyr::group_by(siteiid) |>
dplyr::summarise(unique_vegplots = dplyr::n_distinct(vegplotiid)) |>
dplyr::mutate(has_multiple_vegplots = unique_vegplots > 1) |>
dplyr::filter(has_multiple_vegplots == TRUE)
head(D104_multi_veg)
## # A tibble: 6 × 3
## siteiid unique_vegplots has_multiple_vegplots
## <chr> <int> <lgl>
## 1 1146902 2 TRUE
## 2 1169814 2 TRUE
## 3 1169819 2 TRUE
## 4 1169825 2 TRUE
## 5 1169826 2 TRUE
## 6 1169880 2 TRUE
Creating an aggregated abundance column
Now, we are going to create an aggregated abundance column called, “pct_cover”. There are four columns where abundance data could be stored: akstratumcoverclasspct, speciescancovpct, speciescomppct, understorygrcovpct. A couple of functions will be used to inspect what abundance columns are used in your dataset and whether any records in your dataset have multiple abundance columns used.
QC_aggregate_abundance()
aggregates the four abundance
columns into the new “pct_cover” column. If any records have multiple
abundance columns used, this function will average them and issue a
warning or error depending on the fail_on_dup
argument. For
details on this function - ?QC_aggregate_abundance()
.
D104_veg_agg <- ecositer::QC_aggregate_abundance(veg_df = D104_veg_best)
## Warning in ecositer::QC_aggregate_abundance(veg_df = D104_veg_best): Multiple
## abundance columns are used in this dataset: akstratumcoverclasspct,
## speciescancovpct
Note the two warnings above. As mentioned in the warning,
QC_find_multiple_abundance()
will allow you to inspect
which records have abundance data in multiple columns. These should be
inspected to determine why multiple abundance columns are used. The
other warning shows that two total abundance columns were used in this
dataset.
abund_dups <- ecositer::QC_find_multiple_abundance(veg_df = D104_veg_best)
abund_dups
## [1] siteiid usiteid
## [3] siteobsiid vegplotid
## [5] vegplotiid primarydatacollector
## [7] vegdataorigin ecositeid
## [9] ecositenm ecostateid
## [11] ecostatename commphaseid
## [13] commphasename cancovtotalpct
## [15] cancovtotalclass overstorycancontotalpct
## [17] overstorycancovtotalclass plantsym
## [19] plantsciname plantnatvernm
## [21] akstratumcoverclasspct speciescancovpct
## [23] speciescomppct understorygrcovpct
## [25] speciestraceamtflag vegetationstratalevel
## [27] akstratumcoverclass plantheightcllowerlimit
## [29] plantheightclupperlimit planttypegroup
## [31] horizdatnm utmzone
## [33] utmeasting utmnorthing
## [35] latdegrees latminutes
## [37] latseconds latdir
## [39] longdegrees longminutes
## [41] longseconds longdir
## [43] latstDD longstDD
## <0 rows> (or 0-length row.names)
Choosing coordinate format
The NASIS site table has several possible columns where location data
can be populated. The three categories are UTM; lat/long degrees,
minutes, seconds (DMS), and lat/long decimal degrees.
QC_location_data()
reports what percent of the required
columns are populated for each of these three categories and prompts the
user to specify which category of coordinate format they would like to
use. By default, coordinate_format = NULL
, this function is
interactive. I recommend using it that way. Alternatively, users can
specify the desired coordinate format using the
coordinate_format
argument. This option is used here
because it performs better in an Rmarkdown document.
D104_veg_agg <- ecositer::QC_location_data(veg_df = D104_veg_agg,
coordinate_format = "DD")
## Note -> UTM has 35.2907961603614% completeness
## Note -> Lat/long DMS has 35.2907961603614% completeness
## Note -> Standardized Lat/long DD has 100% completeness. These columns in NASIS imply WGS84 datum is used,
## therefore datum column is not inspected.
# test2 <- ecositer::QC_location_data(veg_df = D104_veg_agg,
# coordinate_format = "DD")
Assign minimum criteria for vegetation data quality
Now that we have an aggregated abundance column, we can begin with
our coarse filter of the data. First we will use
QC_veg_completeness()
to describe the completeness and
quality of vegplot observations.
ecositer::QC_veg_completeness(veg_df = D104_veg_agg) |> head()
## siteiid siteobsiid vegplotiid primarydatacollector vegdataorigin
## 1 1093611 1069569 277250 site existing veg table
## 2 968053 950512 958427 DE spreadsheet form
## 3 888801 864161 954378
## 4 949201 922060 958629 DE spreadsheet form
## 5 949204 922063 276227 site existing veg table
## 6 949214 922073 958642 DESK spreadsheet form
## total_records unique_species percent_to_species percent_with_abund
## 1 3 3 100.00000 0
## 2 17 16 52.94118 100
## 3 1 1 0.00000 0
## 4 16 12 68.75000 100
## 5 8 8 100.00000 100
## 6 18 9 72.22222 100
Next, we can remove plots that do not meet a minimum threshold for data quality and completeness. The threshold should be specific to your data set and study area.
D104_veg_agg <- ecositer::QC_completeness_criteria(veg_df = D104_veg_agg,
min_unique_species = 5,
min_perc_to_species = 60,
min_perc_with_abund = 80)
Update taxonomy using USDA PLANTS NASIS allows plant taxonomies to be used that are not up-to-date. This function cross-references the taxonomies in your dataset to see if there is an updated taxonomy in the USDA PLANTS database. If there is, it updates the taxonomy and provides a message notifying you of the changes made.
D104_veg_agg <- ecositer::QC_update_taxonomy(veg_df = D104_veg_agg)
## Note -> The following taxonomical changes have been made.
## Pinus latifolia changed to Pinus engelmannii
Unify taxonomies, where possible Often times, a dataset has taxonomies that can be unified. Examples include records where species were identified to subspecies or variety when other observations were recorded to species. Sometimes, such a distinction is meaningful, other times it is not. Additionally, sometimes during field data collection, observations are recorded to genus and it is possible to determine species at a later date.
For statistical reasons, taxonomies should be unified where possible. Statistical analyses, including multivariate distance measure (which are essential tools for analysis of ecological communities), do not recognize similarities between taxa. Observations that share the same species (i.e., differ at the subspecies or variety level) are not recognized to have any relationship in most statistical analyses. For this reason, unifying taxa is important. In extreme cases, you may even consider unifying species within a genus, if for example there are limited observations of one species (i.e., it would be removed from the analysis due to lack of observations) and it occupies a very similar niche to another species in the genus. The question is whether unifying creates more signal-to-noise or less. This process requires careful consideration and expert knowledge.
ecositer::QC_unifying_taxa
returns a dataframe of taxa
to QC for unification and the number of observations there are of each
taxa. The rules for this function are: an observation to species level
& an observation beyond species level (e.g., subspecies, variety)
within the same species OR an observation to genus level and an
observation to species within the same genus.
ecositer::QC_unify_taxa(veg_df = D104_veg_agg) |> head()
## plantsciname occurences
## <char> <int>
## 1: Abies magnifica 54
## 2: Abies magnifica var. shastensis 15
## 3: Achnatherum 12
## 4: Achnatherum occidentale 3
## 5: Achnatherum occidentale ssp. occidentale 2
## 6: Achnatherum occidentale ssp. pubescens 1
Another taxonomic consideration - within this package, observations to genus will be removed from statistical analyses. Genus is too variable to produce meaningful results. Consider the genus, Pinus. In California, Pinus sabiniana (foothill pine) grows nearly to sea-level. Pinus albicaulis (whitebark pine) grows up to tree-line. Observations to Pinus, in a dataset like the Sequoia and Kings Canyon National Park Soil Survey, could theoretically be foothill pine or whitebark pine, and this creatse far too much variability in that class to be meaningful.
Looking at the result above, we see Achnatherum occidentale has one observation. There are 9 observations to the genus, Achnatherum. If you thought that the genus Achnatherum was consistent enough for your analysis (e.g., similar species-site relationships), such that using Achnatherum at the genus level would create more signal-to-noise that all observations of Achnatherum being omitted, you could change these observations to ‘Achnatherum spp.’ and they would be included in analyses.
Change Achnatherum occidentale to Achnatherum
# define columns to be changed
columns_to_update <- c("plantsym", "plantsciname", "plantnatvernm")
new_values <- c("ACHNA", "Achnatherum", "needlegrass") # plantsym = ACHNA, plantsciname = Achnatherum, plantnatvernm = needlegrass
# make changes
D104_veg_agg[D104_veg_agg$plantsciname == "Achnatherum occidentale" &
!is.na(D104_veg_agg$plantsciname), columns_to_update] <- as.list(new_values)
Change Achnatherum to Achnatherum spp.
D104_veg_agg$plantsciname <- gsub(pattern = "Achnatherum",
replacement = "Achnatherum spp.",
x = D104_veg_agg$plantsciname)
An additional tool that can be useful for unifying species is looking at the distribution of taxa.
ecositerSpatial::mapping_taxon(veg_df = D104_veg_agg,
taxon = "Ribes",
x = "longstDD",
y = "latstDD",
EPSG = "EPSG:4326")