| Title: | Standardized NEON Organismal Data for Biodiversity Research |
|---|---|
| Description: | Cleaned, simplified, and standardized NEON organismal data for biodiversity research. The following taxonomic groups are included so far: algae, beetles, birds, macroinvertebrates, mosquitoes, plants, small_mammals, ticks, tick_pathogens, and zooplankton. NEON input data (<https://data.neonscience.org>) were processed and standardized. |
| Authors: | Daijiang Li [aut, cre] (ORCID: <https://orcid.org/0000-0002-0925-3421>), Sydne Record [aut], Eric Sokol [aut], Matthew Bitters [aut], Melissa Chen [aut], Anny Chung [aut], Matthew Helmus [aut], Ruvi Jaimes [aut], Lara Jansen [aut], Marta Jarzyna [aut], Michael Just [aut], Jalene LaMontagne [aut], Brett Melbourne [aut], Wynne Moss [aut], Kari Norman [aut], Stephanie Parker [aut], Natalie Robinson [aut], Bijan Seyednasrollah [aut], Colin Smith [aut], Sarah Spaulding [aut], Thilina Surasinghe [aut], Sarah Thomsen [aut], Phoebe Zarnetske [aut] |
| Maintainer: | Daijiang Li <[email protected]> |
| License: | CC0 |
| Version: | 0.2.3 |
| Built: | 2026-05-18 21:11:40 UTC |
| Source: | https://github.com/daijiang/neonDivData |
This dataset was derived from NEON data portal with data product ID 'DP1.20166.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.20166.001.
data_algaedata_algae
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
unique_sample_id: Identity of unique samples (equals sampleID).
observation_datetime: Observation date and time.
taxon_id: Accepted taxon code.
taxon_name: Scientific name associated with the taxon ID.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Cell density value.
unit: Either cells/mL (water column) or cells/cm2 (benthic).
sampleCondition: Condition of the sample (all "Condition OK").
perBottleSampleVolume: Sample volume per bottle (milliliter); fallback-filled where originally missing.
release: Version of data release by NEON.
habitatType: Habitat type sampled.
algalSampleType: Type of algal sample collected.
samplerType: Type of sampler used to collect the sample.
benthicArea: Area of the benthos sampled (square meter).
samplingProtocolVersion: The NEON document number and version where detailed information regarding the sampling method used is available; format NEON.DOC.######vX.
substratumSizeClass: Size class of the substratum sampled.
phytoDepth1: First phytoplankton sample depth (meter) at sampling location.
phytoDepth2: Second phytoplankton sample depth (meter) at sampling location.
phytoDepth3: Third phytoplankton sample depth (meter) at sampling location.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
To clean the data, we:
Filtered alg_biomass to analysisType == "taxonomy" records only.
Computed a fallback bottle volume (preservativeVolume + labSampleVolume) used when perBottleSampleVolume is NA or 0.
Joined alg_taxonomyProcessed to alg_biomass to alg_fieldData via sampleID and parentSampleID.
Filtered to algalParameterUnit == "cellsPerBottle" and sampleCondition == "Condition OK".
Computed cell density: water column samples (seston/phytoplankton) as cells/mL; benthic samples as cells/cm².
Resolved within-sample duplicates by summing density across records sharing the same sampleID and acceptedTaxonID.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Lara Jansen
This dataset was derived from NEON data portal with data product ID 'DP1.10022.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10022.001.
data_beetledata_beetle
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
unique_sample_id: Identity of unique samples (equals sampleID).
trapID: Identifier for the trap.
observation_datetime: Observation date (collection date).
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Abundance (count per trap day); NA for zero-catch events.
unit: Unit of the values in the value column.
boutID: Sampling bout identifier (siteID_collectDate).
nativeStatusCode: The process by which the taxon became established in the location.
'A': Presumed absent; 'N': Native; 'I': Introduced; 'UNK': Status unknown.
release: Version of data release by NEON.
remarks: Remarks (technical notes) of record.
samplingProtocolVersion: The NEON document number and version where detailed information regarding the sampling method used is available; format 'NEON.DOC.######vX'.
samplingImpractical: Flag indicating whether sampling was impractical.
trapConditionFlag: Consolidated trap condition flag from cup, lid, and fluid level status.
trappingDays: Number of days between trap setting and collecting events.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name.
To clean the data, we:
Filtered bet_fielddata to sampleCollected == "Y"; computed trappingDays from set/collect dates; adjusted for traps collected multiple times from the same set date; consolidated trap condition issues into trapConditionFlag from cupStatus, lidStatus, and fluidLevel.
Built a taxonomy resolution hierarchy: Expert ID overrides Parataxonomist ID; both override Sorting taxonomy. Unpinned other carabid individuals are excluded because their sorting-level taxonomy is too coarse (2013-2016 protocol).
Computed unpinned counts per subsample (sorting total minus pinned individuals); clamped to zero where anomalous.
Left-joined downward from field effort to taxonomy to preserve zero-catch trap events (value = NA).
Abundance = count / trappingDays (count per trap day).
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Kari Norman
This dataset was derived from NEON data portal with data product ID 'DP1.10003.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10003.001.
data_birddata_bird
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
pointID: Identifier for a point location.
unique_sample_id: Identity of unique samples (equals eventID).
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Cluster size (count of individuals observed); NA for zero-bird points.
unit: Unit of the values in the value column.
pointCountMinute: The minute of sampling within the point count period.
targetTaxaPresent: Indicator of whether the sample contained individuals of the target taxa.
nativeStatusCode: The process by which the taxon became established in the location.
'A': Presumed absent; 'N': Native; 'I': Introduced; 'UNK': Status unknown.
observerDistance: Radial distance between the observer and the individual(s) being observed (meter).
detectionMethod: How the individual(s) was (were) first detected by the observer.
visualConfirmation: Whether the individual(s) was (were) seen after the initial detection.
sexOrAge: Sex of individual if detectable, age of individual if individual cannot be sexed.
release: Version of data release by NEON.
startCloudCoverPercentage: Observer estimate of percent cloud cover at start of sampling.
endCloudCoverPercentage: Observer estimate of percent cloud cover at end of sampling.
startRH: Relative humidity as measured by handheld weather meter at the start of sampling.
endRH: Relative humidity as measured by handheld weather meter at the end of sampling.
observedHabitat: Observer assessment of dominant habitat at the sampling point at sampling time.
observedAirTemp: The air temperature (Celsius) measured with a handheld weather meter.
kmPerHourObservedWindSpeed: The average wind speed measured with a handheld weather meter, in kilometers per hour.
samplingProtocolVersion: The NEON document number and version where detailed information regarding the sampling method used is available; format 'NEON.DOC.######vX'.
remarks: Remarks of record.
clusterCode: Alphabetic code (A-Z) linking clusters of individuals of the same species spanning multiple records.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
To clean the data, we:
Filtered brd_perpoint to samplingImpractical == "OK" to retain only surveyed points.
Left-joined downward from point survey records to brd_countdata to preserve zero-bird survey points.
Retained valid cluster size values and zero-bird survey rows; value = cluster size (count of individuals).
unique_sample_id = eventID.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location). The sampling protocol has evolved over time, so users are advised to check whether the samplingProtocolVersion fits their data requirements and subset as necessary.
Daijiang Li, Eric Sokol
This dataset was derived from NEON data portal with data product ID 'DP1.20107.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.20107.001. Sampling methods and the design are detailed here: https://www.neonscience.org/data-collection/fish and https://www.neonscience.org/observatory/observatory-blog/one-fish-two-fish-learn-how-neon-samples-fish
data_fishdata_fish
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
pointID: NEON sampling point identifier.
unique_sample_id: Identity of unique samples, usually it has location and date information.
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Value of the variable(s) specified by variable_name.
unit: Unit of the values in the value column.
reachID: An identifier for the set of information associated with the reach.
samplerType: Type of sampler used to collect the sample.
fixedRandomReach: An indication of whether the reach is fixed or random.
measuredReachLength: The length of the reach as measured by the technicians when the reach was established (meters).
efTime: Operational time of the electrofisher (second).
passStartTime: The start time of the pass.
passEndTime: The end time of the pass.
mean_efishtime: Average efish time (in second).
release: Version of data release by NEON.
netSetTime: Time the net was set.
netEndTime: Time the net was pulled.
netDeploymentTime: Total time of deployment of the net (hours).
netLength: Length of the net (meter).
netDepth: Deployment depth of the net (meter).
efTime2: Operational time of the electrofisher for the second electrofisher (second).
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
We downloaded all fish data (i.e., fsh_perPass, fsh_fieldData, fsh_bulkCount, fsh_perFish), including the complete taxon table for fish, for both stream and lake sites surveyed via the NEON API.
We joined the ‘fsh_perPass’, ‘fsh_fieldData’, and ‘fsh_bulkCount’ datasets to produce a table with bulk-processed data that merged ‘fsh_perPass’, ‘fsh_fieldData’, and ‘fsh_perFish’ to concatenate individual-level data.
Finally both individual-level and bulk-processed datasets were appended into a single table. If ‘fsh_bulkCount’ dataset does not have a ‘taxonRank’ column, we added that information based on data stored in ‘scientificName’ column particularly to separate species level identifications. For each finer-resolution taxon in the individual-level dataset, we considered the relative abundance as one since each row represented a single individual fish.
Whenever possible, we substituted missing data by cross-referencing other data columns, omitted completely redundant data columns, and retained records with species-level taxonomic resolution. For the appended dataset, we also calculated the relative abundance for each species per sampling reach or segment at a given site.
To calculate species-specific catch per unit effort ('catch_per_effort'), we normalized the relative abundance by either average electrofishing time (i.e., ‘efTime’, ‘efTime2’) or trap deployment time (i.e., the difference between ‘netEndTime’ and ‘netSetTime’). In this case, we assumed that size of the traps used, water depths, number of netters used, and the reach lengths (a significant proportion of bouts had reach lengths missing) to be comparable across different sampling reaches and segments.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Stephanie Parker, Thilina Surasinghe
https://data.neonscience.org; https://data.neonscience.org/data-products/DP1.20107.001#collectionAndProcessing #' @referencesJensen, Jensen, B., S. Parker, and C. Scott. 2017. Neon user guide to fish electrofishing, gill netting, and fyke netting counts (NEON.DP1.20107). NEON, National Ecological Observatory Network, Boulder, CO, USA.
This dataset was derived from NEON data portal with data product ID 'DP1.10022.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10022.001.
data_herp_bycatchdata_herp_bycatch
A data frame (tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
unique_sample_id: Identity of unique samples, usually it has location and date information.
trapID: Identifier for trap.
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Value of the variable(s) specified by variable_name.
unit: Unit of the values in the value column.
trappingDays: Cleaned up decimal days between trap setting and collecting events
release: Version of data release by NEON.
nativeStatusCode: The process by which the taxon became established in the location only provided for vert bycatch herp
remarksSorting: Technician notes; free text comments accompanying the record from sorting table
remarksFielddata: Technician notes; free text comments accompanying the record from fielddata table
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
nlcdClass: National Land Cover Database Vegetation Type Name.
To process data we:
Cleaned trappingDays.
So that is is the number of days a trap was set before being collected.
Correct trap days to account for entries where the trap set date was not updated based on a previous collection.
Create a boutID that identifies all trap collection events at a site in the same bout essentially replacing eventID.
Update collectDate to reference the most common collection day in a bout, maintaining one collectDate per bout.
sampleType provides the group that was caught in the pit fall trap. This was changed to have three levels
"vert bycatch herp" - these are the samples
"no data collected" - these samples in fielddata not in sorting
"not herp" - this is a aggregate of all the other types "other carabid", "invert bycatch", "carabid", "vert bycatch mam" And we only kept "vert bycatch herp" in the final data product.
This script was derived from the script written by Kari Norman to process the pit fall traps of beetles. Additional variables were added and missing samples were retained in herp_bycatch.
Matt Helmus and Kari Norman
Hoekman, David, Katherine E. LeVan, Cara Gibson, George E. Ball, Robert A. Browne, Robert L. Davidson, Terry L. Erwin, et al. “Design for Ground Beetle Abundance and Diversity Sampling within the National Ecological Observatory Network.” Ecosphere 8, no. 4 (2017): e01744.
This dataset was derived from NEON data portal with data product ID 'DP1.20120.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.20120.001.
data_macroinvertebratedata_macroinvertebrate
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
unique_sample_id: Identity of unique samples (equals sampleID).
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Density (count per square meter).
unit: Unit of the values in the value column.
estimatedTotalCount: Estimated total count (summed across size classes).
individualCount: Raw individual count (summed across size classes).
subsamplePercent: Percent of the total sample contained in the subsample.
release: Version of data release by NEON.
benthicArea: Area sampled (square meter).
habitatType: Habitat type sampled.
samplerType: Type of sampler used to collect the sample.
substratumSizeClass: Size class of the substratum sampled.
remarks: Remarks of record.
ponarDepth: Depth (meter) of petite ponar sample.
snagLength: Length (meter) of snag sampled.
snagDiameter: Diameter (meter) of snag sampled.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
To clean the data, we:
Deduplicated inv_fieldData by sampleID using slice(1) to guard against NEON's known aquatic duplicate metadata issue.
Filtered inv_taxonomyProcessed to targetTaxaPresent == "Y"; summed estimatedTotalCount and individualCount across size-class records sharing the same sampleID and acceptedTaxonID.
Inner-joined taxonomy to field data (inner join required because benthicArea is needed to compute density; records without field metadata are dropped).
Density = estimatedTotalCount / benthicArea (count per square meter).
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Stephanie Parker, Eric Sokol
This dataset was derived from NEON data portal with data product ID 'DP1.10043.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10043.001.
data_mosquitodata_mosquito
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
unique_sample_id: Identity of unique samples (equals sampleID).
subsampleID: Unique identifier associated with each subsample per sampleID.
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Abundance (count per trap hour); NA for zero-catch traps.
unit: Unit of the values in the value column ('count per trap hour').
nativeStatusCode: The process by which the taxon became established in the location.
'A': Presumed absent; 'N': Native; 'I': Introduced; 'UNK': Status unknown.
proportionIdentified: Proportion of the total catch that was subsampled and identified.
release: Version of data release by NEON.
remarks_sorting: Technician notes; free text comments accompanying the sorting record.
samplingProtocolVersion: The NEON document number and version where detailed information regarding the sampling method used is available; format 'NEON.DOC.######vX'.
sex: M for male, F for female, U for unknown.
sortDate: Date sample was sorted.
trapHours: Number of hours between trap setting and collecting events.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
To clean the data, we:
Joined mos_trapping to mos_sorting to mos_expertTaxonomistIDProcessed.
Filtered to targetTaxaPresent == "Y", sampleCondition == "No known compromise", and taxonRank != "family".
Estimated total individuals per subsample = individualCount / proportionIdentified.
Abundance = estimated total individuals / trapHours (count per trap hour).
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location). We retained records without a taxon_id (where value is NA) to preserve sampling effort for traps that caught zero mosquitoes.
Natalie Robinson, Daijiang Li
This dataset was derived from NEON data portal with data product ID 'DP1.10058.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10058.001
data_plantdata_plant
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
unique_sample_id: Identity of unique samples, usually it has location and date information.
subplotID: This is the NEON provided subplot ID in the format of subplot_id, then subsubplot_id and 1 or 10 (m^2); if the sampling unit is 100 m^2, the values are 31, 32, 40, and 41.
subplot_id: Subplot ID; each plot normally has four 100 m^2 subplots (31, 32, 40, 41).
subsubplot_id: Subsubplot ID (1, 2, 3, 4) for sampling units at 1 m^2 or 10 m^2.
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen. Values are 'genus', 'species', 'speciesGroup', 'subSpecies', or 'variety.' Species accounts for the majority of the entries. Higher ranks have already been filtered out because we think they are too vague for biodiversity research.
variable_name: The variable name(s) represented by the value column.
value: Value of the variable(s) specified by variable_name. If the individual was observed out of 1 square meter subplots, the value will be NA (i.e., presence only).
unit: Unit of the values in the value column.
presence_absence: All 1s since every record represent a species.
boutNumber: Number of sampling bout, most sites sample only 1 bout.
nativeStatusCode: Whether the species is a native or non-native species. 'A': Presumed absent, due to lack of data indicating a taxon's presence in a given location; 'N': Native; 'N?': Probably Native; 'I': Introduced; 'I?': Probably Introduced; 'NI': Native and Introduced, some infrataxa are native and others are introduced; 'NI?': Probably Native and Introduced, some infrataxa are native and others are introduced; 'UNK': Status unknown.
heightPlantOver300cm: Indicator of whether individuals of the species in the sample are taller than 300 cm.
heightPlantSpecies: Ocular estimate of the height (centimeter) of the plant species (if height is < 300 cm).
release: Version of data release by NEON.
sample_area_m2: The area of the sampling unit that the observed plant was located in. Potential values are 1, 10, or 100.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
nlcdClass: National Land Cover Database Vegetation Type Name.
release: Version of data release by NEON.
accepted_wcvp_name: WCVP-accepted full scientific name (binomial or trinomial with authors) for species-, subspecies-, and variety-level taxa. For names matched to WCVP synonyms, this is the accepted name; for names with no acceptable WCVP match, this falls back to the original NEON taxon_name. Genus-level and speciesGroup-level taxa retain their original NEON names.
accepted_wcvp_name_binomial: Binomial (genus + specific epithet, without authors) extracted from accepted_wcvp_name. Useful for joining with other datasets that use binomial names only.
The detailed design of NEON plant survey can be found in Barnett et al. 2019. Here, we:
Removed 1 m^2 data with targetTaxaPresent = N
Removed rows missing values for plotID, subplotID, boutNumber, endDate, and/or taxonID
Removed duplicate taxa between nested subplots (each taxon should be represented once for the bout/plotID/year). For example, if a taxon/date/bout/plot combo is present in 1 m^2 data, remove from 10 m^2 and above
Standardized taxon names against the World Checklist of Vascular Plants (WCVP) using the rWCVP package. Specifically:
Only species-, subspecies-, and variety-level taxa were matched; genus-level and speciesGroup-level taxa were excluded.
NEON scientific names (which include author strings) were split into name and author components before matching. For subspecies and variety names, the rank keyword (subsp., ssp., or var.) was used to locate the split point; for species-level names, the first two tokens were taken as the name and the remainder as the author.
Names were matched against WCVP using wcvp_match_names(fuzzy = TRUE).
All exact matches (both "Exact with author" and "Exact without author") were accepted, as minor author-string formatting differences between NEON and WCVP conventions do not indicate a different taxon.
Fuzzy matches (edit distance or phonetic) were accepted only when match similarity >= 0.9 and edit distance == 1.
For names matched to WCVP synonyms, the name was replaced with the corresponding WCVP-accepted name. The standardized names are stored in accepted_wcvp_name and accepted_wcvp_name_binomial; the original NEON name is retained in taxon_name.
Names with no acceptable WCVP match retain their original NEON name in both accepted_wcvp_name and accepted_wcvp_name_binomial.
Stacked species occurrence from different scales into a long data frame. Therefore,
to get the species list at 1 m^2 scale, we need all the data with sample_area_m2 == 1 (e.g. dplyr::filter(plants, sample_area_m2 == 1)); the unique sample unit id can then be generated with paste(plants$plotID, plants$subplot_id, plants$subsubplot_id, sep = "_")
to get the species list at 10 m^2 scale, we need all the data with sample_area_m2 <= 10 (e.g. dplyr::filter(plants, sample_area_m2 <= 10)); the unique sample unit id can then be generated with paste(plants$plotID, plants$subplot_id, plants$subsubplot_id, sep = "_")
to get the species list at 100 m^2 scale, we need use the whole data set since the maximum value of sample_area_m2 is 100 (i.e. a 10 m by 10 m subplot); the unique sample unit id can then be generated with paste(plants$plotID, plants$subplot_id, sep = "_")
to get the species list at 400 m^2 scale (i.e. one plot with four subplots), we need aggregate the data at plotID level (the sample unit is the plot now).
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Daijiang Li, Michael Just
Barnett, D.T., Adler, P.B., Chemel, B.R., Duffy, P.A., Enquist, B.J., Grace, J.B., Harrison, S., Peet, R.K., Schimel, D.S., Stohlgren, T.J. and Vellend, M., 2019. The plant diversity sampling design for the national ecological observatory network. Ecosphere, 10(2), p.e02603.
This dataset was derived from NEON data portal with data product ID 'DP1.10072.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10072.001.
data_small_mammaldata_small_mammal
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
unique_sample_id: Identity of unique samples, usually it has location and date information.
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Value of the variable(s) specified by variable_name.
unit: Unit of the values in the value column.
year: Observation year.
month: Observation month.
n_trap_nights_per_bout_per_plot: Number of trap nights per bout per plot.
n_nights_per_bout: Number of nights per bout.
nativeStatusCode: Whether the species is a native or non-native species.
'A': Presumed absent, due to lack of data indicating a taxon's presence in a
given location; 'N': Native; 'I': Introduced; 'UNK': Status unknown.
release: Version of data release by NEON.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
nlcdClass: National Land Cover Database Vegetation Type Name.
To process data we:
Filter nights from mam_perplotnight where samplingImpractical is not "OK" (field added 2020; pre-2020 records without this field are retained).
Compute trapping effort as the number of unique trap coordinates deployed per night-uid, summed across all valid nights within a bout per plot (n_trap_nights_per_bout_per_plot).
Identify bouts using eventID from mam_perplotnight; fall back to year_month for records lacking an eventID.
Retain only capture records with trapStatus "5 - capture" or "4 - more than 1 capture in one trap", restricting to taxa identified to genus or finer rank.
Resolve within-bout recaptures: tagged individuals (non-missing tagID) are counted once per bout; untagged individuals (missing tagID) are each treated as unique using their row identifier.
Compute value as 100 * raw_count / n_trap_nights_per_bout_per_plot (unique individuals per 100 trap nights per plot per month).
Retain effort-only rows (bouts with no qualifying captures) with taxon_id, taxon_name, taxon_rank, and value set to NA.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Marta Jarzyna
One row per taxonomic group. Computed directly from the cleaned datasets
by data-raw/02_clean_save_data.R.
data_summarydata_summary
A data frame with the following columns:
taxon_group: Taxonomic group name (e.g. "ALGAE", "BEETLES").
neon_product_id: NEON data product ID (e.g. "DP1.20166.001").
r_object: Name of the R data object for this group (e.g. "data_algae").
n_taxa: Number of unique taxa included.
n_sites: Number of NEON sites with data.
sites: All site codes that have data, separated by |.
start_date: The earliest observation date in the dataset.
end_date: The latest observation date in the dataset.
variable_names: The variable name(s) in the value column.
units: The unit(s) of the value column.
This dataset was derived from NEON data portal with data product ID 'DP1.10093.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10093.001.
data_tickdata_tick
A data frame (also a tibble) with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
unique_sample_id: Identity of unique samples (plotID_collectDate).
observation_datetime: Observation date and time.
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name, associated with the taxonID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Abundance (count per square meter).
unit: Unit of the values in the value column.
LifeStage: Life stage of the tick (Larva, Nymph, or Adult).
release: Version of data release by NEON.
remarks_field: Technician notes; free text comments accompanying the record.
samplingMethod: Name or code for the method used to collect or test a sample.
targetTaxaPresent: Indicator of whether the sample contained individuals of the target taxa ('Y' or 'N').
totalSampledArea: Total area sampled (square meter).
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
To clean the data, we:
Mapped life stage from tck_taxonomyProcessed sex/age codes to Larva, Nymph, or Adult; rows with unmappable values are dropped.
Aggregated taxonomy counts by sampleID, acceptedTaxonID, and LifeStage; concatenated release values.
Applied the IXOSP2 reconciliation fix: when field counts (from tck_fielddata) exceed lab counts for a given life stage, the difference is assigned to Order Ixodida (IXOSP2, "Ixodida sp.") rather than using fragile string-matching.
Left-joined downward from field effort to lab taxonomy to preserve empty drag events (targetTaxaPresent == "N"), which receive individualCount = 0.
Abundance = individualCount / totalSampledArea (count per square meter); invalid or non-finite values are dropped except for empty drag rows.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Wynne Moss, Melissa Chen, Brendan Hobart, Matt Bitters
This dataset was derived from NEON data portal with data product ID 'DP1.10092.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.10092.001.
data_tick_pathogendata_tick_pathogen
A data frame (also a tibble) with the following columns:
location_id: Location id (named location).
siteID: NEON site code.
plotID: Plot identifier.
unique_sample_id: Identity of unique samples (namedLocation_collectDate).
observation_datetime: Observation date and time.
taxon_id: Pathogen name (standardized).
taxon_name: Pathogen name (same as taxon_id).
taxon_rank: Taxonomic rank inferred from name ("genus" for sp./spp. names, otherwise "species").
variable_name: The variable name(s) represented by the value column.
value: Positivity rate (n_positive_test / n_tests).
unit: Unit of the values in the value column ("positive tests per pathogen per sampling event").
lifeStage: Life stage of the host tick (extracted from subsampleID).
testProtocolVersion: The protocol version used to test the sample.
release: Version of data release by NEON.
n_tests: Number of tests conducted.
n_positive_test: Number of tests that were positive.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name.
plotType: NEON plot type in which sampling occurred: tower, distributed or gradient.
To clean the data, we:
Removed tests from batches that failed quality criteria (criteriaMet != "Y" in tck_pathogenqa).
Removed samples where sampleCondition != "OK" or testResult is NA.
Applied the DNA quality fix: identified ticks (testingID) whose HardTick DNA Quality test was not "Positive" and dropped all test rows for those ticks (not just the DNA quality row itself), because pathogen results from ticks with degraded DNA are unreliable.
Removed HardTick DNA Quality and Ixodes pacificus test rows; unified "Borrelia burgdorferi" into "Borrelia burgdorferi sensu lato".
Extracted lifeStage from the last dot-delimited segment of subsampleID.
Aggregated to one row per location x date x pathogen x life stage: value = n_positive_test / n_tests.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Melissa Chen, Wynne Moss, Brendan Hobart, Matt Bitters
This dataset was derived from NEON data portal with data product ID 'DP1.20219.001'. Details about this data product can be found at https://data.neonscience.org/data-products/DP1.20219.001. Zooplankton are collected from the water column of lakes near NEON sensor infrastructure. The type of sampler used depends on the depth of water at the sampling location. Multiple tows or traps are collected at each location and composited into a single sample. Zooplankton sampling is quantitative and based on the volume of water collected during sampling.
data_zooplanktondata_zooplankton
A data frame (also a tibble) with the following columns:
location_id: Location id (named location).
siteID: NEON site code.
unique_sample_id: Identity of unique samples (equals sampleID).
observation_datetime: Observation date and time.
taxon_id: Accepted taxon code.
taxon_name: Scientific name associated with the taxon ID.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
variable_name: The variable name(s) represented by the value column.
value: Density (count per liter).
unit: Unit of the values in the value column ("count per liter").
release: Version of data release by NEON.
samplerType: Type of sampler used to collect the sample.
towsTrapsVolume: Sample volume (liter) collected for zooplankton.
latitude: The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
To clean the data, we:
Deduplicated zoo_fieldData by sampleID using slice(1) to guard against NEON's known aquatic duplicate metadata issue.
Filtered zoo_taxonomyProcessed to sampleCondition == "condition OK" (case-insensitive); summed adjCountPerBottle across size-class measurement records sharing the same sampleID and taxonID. adjCountPerBottle is already corrected for subsampling by the laboratory.
Inner-joined taxonomy to field data (inner join required because towsTrapsVolume is needed to compute density; records without field metadata are dropped).
Density = sum(adjCountPerBottle) / towsTrapsVolume (count per liter), per user guide equation 2.
Details of locations (e.g. latitude/longitude coordinates can be found in neon_location).
Lara Jansen, Stephanie Parker
We extracted location information from all taxanomic groups and saved as one file here. Note that some aquatic sites do not have lat/long information though.
neon_locationneon_location
A data frame with the following columns:
location_id: Location id.
siteID: NEON site code.
plotID: Plot identifier (NEON site code_XXX).
latitude: The geographic latitude (in decimal degrees, WGS84) of the
geographic center of the reference area.
longitude: The geographic longitude (in decimal degrees, WGS84) of
the geographic center of the reference area.
elevation: Elevation (in meters) above sea level.
nlcdClass: National Land Cover Database Vegetation Type Name for terrestrial sites.
aquaticSiteType: Type of aquatic systems ('lake', 'river', 'stream').
Full names, types, coordinates of all 81 NEON sites.
neon_sitesneon_sites
A data frame with the following columns:
Site Name: Full site name.
siteID: NEON site code.
Domain Name: Full domain name.
domainID: Unique identifier of the NEON domain.
State: The state name of the site locates in.
Latitude: Latitude of the site (in decimal degrees, WGS84).
Longitude: Longitude of the site (in decimal degrees, WGS84).
Site Type: The type of the site (e.g. Core Terrestrial).
Site Subtype: Second level site type, for aquatic sites only (e.g. Lake, Wadeable Stream, Non-wadeable River).
Site Host: Host organization of the site.
This data frame was assembled from all taxonomic data products in the package.
It is updated each release by data-raw/02_clean_save_data.R and preserves
taxa from previous releases so that names are not lost between NEON data versions.
neon_taxaneon_taxa
A data frame with the following columns:
taxon_id: Accepted species code, based on one or more sources.
taxon_name: Scientific name associated with the taxon ID. This is the name
of the lowest level taxonomic rank that can be determined.
taxon_rank: The lowest level taxonomic rank that can be determined for the individual or specimen.
taxon_group: The taxonomic group the taxon belongs to (e.g. "ALGAE", "BEETLES").
NEON source tables use either taxonID or acceptedTaxonID depending on
the data product. Both are standardized to taxon_id here.