QC vegetation data

library

library(data.table)

About

ecositer has a category of functions used for QC, all of which use a “QC_” prefix. Vegetation data stored in NASIS requires QC. There are numerous issues that exist caused by different data vintages, data origins, varying data quality, duplication caused by KSSL lab pedons, outdated plant taxonomies, and more. This vignette discusses a best practice methodology for QC NASIS vegetation data.

You should start your QC process by removing data at the coarsest level that meets your objectives. This will ensure that you do not spend time fixing obscure edge cases that are ultimately removed.

To begin with, we are going to build our dataset:

D104_veg <- ecositer::create_veg_df(from = "web_report",
                                 ecositeid = "F022AD104CA")

Choose the best vegetation plot for site

Sites may have multiple vegetation plots with differing data qualities. If there are multiple vegetation plots associated with a site, QC_best_vegplot_for_site() chooses the vegetation plot with the most records.

D104_veg_best <- ecositer::QC_best_vegplot_for_site(veg_df = D104_veg)

## Warning -> There are sites with multiple vegetation plots. Reviewing these sites is preferable to automated selection. To view which sites have multiple vegetation plots:
##               'Your veg_df' |> dplyr::group_by(siteiid) |>
##                                dplyr::summarise(unique_vegplots = dplyr::n_distinct(vegplotiid)) |>
##                                dplyr::filter(unique_vegplots > 1)

To look at all the sites with multiple vegetation plots:

D104_multi_veg <- D104_veg |> dplyr::group_by(siteiid) |> 
  dplyr::summarise(unique_vegplots = dplyr::n_distinct(vegplotiid)) |> 
  dplyr::mutate(has_multiple_vegplots = unique_vegplots > 1) |> 
  dplyr::filter(has_multiple_vegplots == TRUE)

head(D104_multi_veg)

## # A tibble: 6 × 3
##   siteiid unique_vegplots has_multiple_vegplots
##   <chr>             <int> <lgl>                
## 1 1146902               2 TRUE                 
## 2 1169814               2 TRUE                 
## 3 1169819               2 TRUE                 
## 4 1169825               2 TRUE                 
## 5 1169826               2 TRUE                 
## 6 1169880               2 TRUE

Creating an aggregated abundance column

Now, we are going to create an aggregated abundance column called, “pct_cover”. There are four columns where abundance data could be stored: akstratumcoverclasspct, speciescancovpct, speciescomppct, understorygrcovpct. A couple of functions will be used to inspect what abundance columns are used in your dataset and whether any records in your dataset have multiple abundance columns used.

QC_aggregate_abundance() aggregates the four abundance columns into the new “pct_cover” column. If any records have multiple abundance columns used, this function will average them and issue a warning or error depending on the fail_on_dup argument. For details on this function - ?QC_aggregate_abundance().

D104_veg_agg <- ecositer::QC_aggregate_abundance(veg_df = D104_veg_best)

## Warning in ecositer::QC_aggregate_abundance(veg_df = D104_veg_best): Multiple
## abundance columns are used in this dataset: akstratumcoverclasspct,
## speciescancovpct

Note the two warnings above. As mentioned in the warning, QC_find_multiple_abundance() will allow you to inspect which records have abundance data in multiple columns. These should be inspected to determine why multiple abundance columns are used. The other warning shows that two total abundance columns were used in this dataset.

abund_dups <- ecositer::QC_find_multiple_abundance(veg_df = D104_veg_best)
abund_dups

##  [1] siteiid                   usiteid                  
##  [3] siteobsiid                vegplotid                
##  [5] vegplotiid                primarydatacollector     
##  [7] vegdataorigin             ecositeid                
##  [9] ecositenm                 ecostateid               
## [11] ecostatename              commphaseid              
## [13] commphasename             cancovtotalpct           
## [15] cancovtotalclass          overstorycancontotalpct  
## [17] overstorycancovtotalclass plantsym                 
## [19] plantsciname              plantnatvernm            
## [21] akstratumcoverclasspct    speciescancovpct         
## [23] speciescomppct            understorygrcovpct       
## [25] speciestraceamtflag       vegetationstratalevel    
## [27] akstratumcoverclass       plantheightcllowerlimit  
## [29] plantheightclupperlimit   planttypegroup           
## [31] horizdatnm                utmzone                  
## [33] utmeasting                utmnorthing              
## [35] latdegrees                latminutes               
## [37] latseconds                latdir                   
## [39] longdegrees               longminutes              
## [41] longseconds               longdir                  
## [43] latstDD                   longstDD                 
## <0 rows> (or 0-length row.names)

Choosing coordinate format

The NASIS site table has several possible columns where location data can be populated. The three categories are UTM; lat/long degrees, minutes, seconds (DMS), and lat/long decimal degrees. QC_location_data() reports what percent of the required columns are populated for each of these three categories and prompts the user to specify which category of coordinate format they would like to use. By default, coordinate_format = NULL, this function is interactive. I recommend using it that way. Alternatively, users can specify the desired coordinate format using the coordinate_format argument. This option is used here because it performs better in an Rmarkdown document.

D104_veg_agg <- ecositer::QC_location_data(veg_df = D104_veg_agg,
                                           coordinate_format = "DD")

## Note -> UTM has 35.2907961603614% completeness

## Note -> Lat/long DMS has 35.2907961603614% completeness

## Note -> Standardized Lat/long DD has 100% completeness. These columns in NASIS imply WGS84 datum is used,
##                   therefore datum column is not inspected.

# test2 <- ecositer::QC_location_data(veg_df = D104_veg_agg,
#                                     coordinate_format = "DD")

Assign minimum criteria for vegetation data quality

Now that we have an aggregated abundance column, we can begin with our coarse filter of the data. First we will use QC_veg_completeness() to describe the completeness and quality of vegplot observations.

ecositer::QC_veg_completeness(veg_df = D104_veg_agg) |> head()

##   siteiid siteobsiid vegplotiid primarydatacollector           vegdataorigin
## 1 1093611    1069569     277250                      site existing veg table
## 2  968053     950512     958427                   DE        spreadsheet form
## 3  888801     864161     954378                                             
## 4  949201     922060     958629                   DE        spreadsheet form
## 5  949204     922063     276227                      site existing veg table
## 6  949214     922073     958642                 DESK        spreadsheet form
##   total_records unique_species percent_to_species percent_with_abund
## 1             3              3          100.00000                  0
## 2            17             16           52.94118                100
## 3             1              1            0.00000                  0
## 4            16             12           68.75000                100
## 5             8              8          100.00000                100
## 6            18              9           72.22222                100

Next, we can remove plots that do not meet a minimum threshold for data quality and completeness. The threshold should be specific to your data set and study area.

D104_veg_agg <- ecositer::QC_completeness_criteria(veg_df = D104_veg_agg,
                                   min_unique_species = 5,
                                   min_perc_to_species = 60,
                                   min_perc_with_abund = 80)

Update taxonomy using USDA PLANTS NASIS allows plant taxonomies to be used that are not up-to-date. This function cross-references the taxonomies in your dataset to see if there is an updated taxonomy in the USDA PLANTS database. If there is, it updates the taxonomy and provides a message notifying you of the changes made.

D104_veg_agg <- ecositer::QC_update_taxonomy(veg_df = D104_veg_agg)

## Note -> The following taxonomical changes have been made.

## Pinus latifolia changed to Pinus engelmannii

Unify taxonomies, where possible Often times, a dataset has taxonomies that can be unified. Examples include records where species were identified to subspecies or variety when other observations were recorded to species. Sometimes, such a distinction is meaningful, other times it is not. Additionally, sometimes during field data collection, observations are recorded to genus and it is possible to determine species at a later date.

For statistical reasons, taxonomies should be unified where possible. Statistical analyses, including multivariate distance measure (which are essential tools for analysis of ecological communities), do not recognize similarities between taxa. Observations that share the same species (i.e., differ at the subspecies or variety level) are not recognized to have any relationship in most statistical analyses. For this reason, unifying taxa is important. In extreme cases, you may even consider unifying species within a genus, if for example there are limited observations of one species (i.e., it would be removed from the analysis due to lack of observations) and it occupies a very similar niche to another species in the genus. The question is whether unifying creates more signal-to-noise or less. This process requires careful consideration and expert knowledge.

ecositer::QC_unifying_taxa returns a dataframe of taxa to QC for unification and the number of observations there are of each taxa. The rules for this function are: an observation to species level & an observation beyond species level (e.g., subspecies, variety) within the same species OR an observation to genus level and an observation to species within the same genus.

ecositer::QC_unify_taxa(veg_df = D104_veg_agg) |> head()

##                                plantsciname occurences
##                                      <char>      <int>
## 1:                          Abies magnifica         54
## 2:          Abies magnifica var. shastensis         15
## 3:                              Achnatherum         12
## 4:                  Achnatherum occidentale          3
## 5: Achnatherum occidentale ssp. occidentale          2
## 6:   Achnatherum occidentale ssp. pubescens          1

Another taxonomic consideration - within this package, observations to genus will be removed from statistical analyses. Genus is too variable to produce meaningful results. Consider the genus, Pinus. In California, Pinus sabiniana (foothill pine) grows nearly to sea-level. Pinus albicaulis (whitebark pine) grows up to tree-line. Observations to Pinus, in a dataset like the Sequoia and Kings Canyon National Park Soil Survey, could theoretically be foothill pine or whitebark pine, and this creatse far too much variability in that class to be meaningful.

Looking at the result above, we see Achnatherum occidentale has one observation. There are 9 observations to the genus, Achnatherum. If you thought that the genus Achnatherum was consistent enough for your analysis (e.g., similar species-site relationships), such that using Achnatherum at the genus level would create more signal-to-noise that all observations of Achnatherum being omitted, you could change these observations to ‘Achnatherum spp.’ and they would be included in analyses.

Change Achnatherum occidentale to Achnatherum

# define columns to be changed 
columns_to_update <- c("plantsym", "plantsciname", "plantnatvernm")

new_values <- c("ACHNA", "Achnatherum", "needlegrass") # plantsym = ACHNA, plantsciname = Achnatherum, plantnatvernm = needlegrass

# make changes
D104_veg_agg[D104_veg_agg$plantsciname == "Achnatherum occidentale" & 
               !is.na(D104_veg_agg$plantsciname), columns_to_update] <- as.list(new_values)

Change Achnatherum to Achnatherum spp.

D104_veg_agg$plantsciname <- gsub(pattern = "Achnatherum",
                                  replacement = "Achnatherum spp.",
                                  x = D104_veg_agg$plantsciname)

An additional tool that can be useful for unifying species is looking at the distribution of taxa.

ecositerSpatial::mapping_taxon(veg_df = D104_veg_agg,
                               taxon = "Ribes",
                               x = "longstDD",
                               y = "latstDD",
                               EPSG = "EPSG:4326")

Nathan Roe

2024-12-11