# Load R libraries
library(ggplot2)
library(tidyverse)
library(dplyr)

# New library: Lin's CCC
library(agRee) 
  # Must have JAGS installed: https://sourceforge.net/projects/mcmc-jags/

# Alternative library for Lin's CCC: DescTools
library(DescTools)

# New library: Bland-Atman analysis
library(blandr) 

Introduction & Background

Species richness in fish is measured using a variety of methods. High-throughput sequencing of environmental DNA (eDNA), or DNA metabarcoding, is a method of detecting and measuring ecological communities. It is advantageous because it can detect trace amounts of species in a variety of different environments and may detect rare or elusive fish species that are difficult to detect with conventional surveys (McElroy et. al. 2020).

This dataset collected peer-reviewed studies that used eDNA metabarcoding and conventional methods of surveying fish species richness. The dataset contains 37 studies that covered 121 independent sites where species richness was measured using conventional methods and eDNA metabarcoding.

We will use linear regression as covered in this course as well as two new techniques, Lin’s Cocordance coefficient and Bland-Altman analysis. We will analyze the following research questions:

Research Questions

Using Lin’s Concordance Correlation Coefficient (CCC)
This method will complement Pearson’s R coefficient when analyzing the relationship between the variables. It measures both correlation and agreement between variables. The data is evaluated on how well two variables on a scatter plot fall on a 45° line through the origin (Lin 1989). It compares two measurements of the same variable and is used to compare those methods. It is often used to measure if a method can act as well as a “gold standard”. It is similar to Pearson’s r where the values lie between -1 and 1, and relationships are considered strong the closer the value is to +1 or -1. Altman (1991) interprets Lin’s CCC as anything above ±0.8 as “excellent” and anything below ±0.2 as “poor”.

Using Bland-Atman Analysis
Otherwise known as Tukey’s mean difference plot, the Bland-Atman Analysis plot, like Lin’s CCC, compares the differences between two different methods of measurement (Giavarina 2015). Correlation studies using Pearson’s R does not measure agreement, and it is not recommended to compare two different methods of measurement (Giavarina 2015). The result is a plot of the difference between two paired measurements against the mean of two measurements. It is recommended 95% of data points lie withn 2 standard deviations of the mean difference (Giavarina 2015). They also show directionality in performance.

Both these methods will be used to compare how well eDNA metabarcoding measures species richness compared to conventional methods.

Analysis of Dataset

Step 1: Explore the dataset.

## 'data.frame':    121 obs. of  31 variables:
##  $ author       : Factor w/ 37 levels "Balasingham",..: 1 1 2 3 4 5 6 6 7 8 ...
##  $ year_pub     : int  2018 2018 2020 2018 2019 2018 2016 2016 2019 2019 ...
##  $ site         : Factor w/ 117 levels "Ain River","Backwater lake 1",..: 30 112 13 62 29 29 39 113 14 102 ...
##  $ dna_richness : int  50 54 133 13 106 131 14 14 54 38 ...
##  $ trad_richness: int  47 56 185 7 203 203 18 16 72 11 ...
##  $ dna_only     : int  12 19 66 6 NA 12 1 0 33 31 ...
##  $ trad_only    : int  9 21 120 0 NA 84 5 2 51 4 ...
##  $ shared       : int  38 35 67 7 NA 119 13 14 21 7 ...
##  $ continent    : Factor w/ 6 levels "Africa","Asia",..: 5 5 3 3 6 6 4 4 5 4 ...
##  $ system       : Factor w/ 2 levels "freshwater","marine": 1 1 2 1 1 1 1 1 2 2 ...
##  $ lentic_lotic : Factor w/ 3 levels "lentic","lotic",..: 2 2 3 2 2 2 1 2 3 3 ...
##  $ marker_count : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ locus        : Factor w/ 8 levels "12s","12s_16s",..: 7 7 6 1 1 1 1 1 1 4 ...
##  $ marker_cat   : Factor w/ 2 levels "multiple","single": 2 2 2 2 2 2 2 2 2 1 ...
##  $ gear_type    : Factor w/ 18 levels "angling","electrofishing",..: 5 12 18 2 9 9 5 2 5 5 ...
##  $ gear_cat     : Factor w/ 3 levels "multiple","single",..: 1 1 1 2 1 1 2 1 2 1 ...
##  $ match_effort : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 2 1 2 ...
##  $ volume_L     : num  90.5 83 32.7 96 1700 ...
##  $ sample_depth : Factor w/ 3 levels "multidepth","subsurface",..: 1 1 3 3 3 3 3 3 1 3 ...
##  $ sample_type  : Factor w/ 2 levels "continuous","separate": 2 2 2 2 1 1 1 1 2 2 ...
##  $ capture_type : Factor w/ 2 levels "filtration","precipitation": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Filter_type  : Factor w/ 10 levels "","CN","GF","MCE",..: 3 3 4 3 10 8 8 8 9 8 ...
##  $ pore_size_µm : Factor w/ 9 levels "","0.2","0.2 and 0.45",..: 9 9 3 9 5 8 8 8 4 4 ...
##  $ Ext_method   : Factor w/ 8 levels "CI","DNeasy BT",..: 5 5 2 7 4 2 2 2 2 6 ...
##  $ fish_reads   : Factor w/ 19 levels "","1,050,515",..: 11 15 3 1 8 2 18 17 1 1 ...
##  $ primer_count : int  1 1 1 1 1 1 1 1 1 4 ...
##  $ primer_refs  : Factor w/ 23 levels "12S-F1nioz/R1",..: 23 23 10 19 9 9 9 9 19 21 ...
##  $ primer_bp    : Factor w/ 23 levels "","100","100, 235",..: 19 19 1 14 22 22 22 22 14 15 ...
##  $ pcr_reps     : int  NA NA 2 3 12 12 12 12 3 2 ...
##  $ seq_platform : Factor w/ 8 levels "BGISEQ-500","Illumina HiSeq",..: 6 6 4 4 3 5 4 4 4 4 ...
##  $ X            : logi  NA NA NA NA NA NA ...
##        author      year_pub                    site      dna_richness   
##  Fujii    :31   Min.   :2012   French Guiana     :  2   Min.   :  0.00  
##  Valentini:23   1st Qu.:2016   Lake Windemere, UK:  2   1st Qu.:  8.00  
##  Li       :14   Median :2019   Rhone River       :  2   Median : 14.00  
##  Sard     : 8   Mean   :2018   Tier River        :  2   Mean   : 21.03  
##  Collins  : 4   3rd Qu.:2019   Ain River         :  1   3rd Qu.: 23.00  
##  Hanfling : 3   Max.   :2020   Backwater lake 1  :  1   Max.   :133.00  
##  (Other)  :38                  (Other)           :111                   
##  trad_richness       dna_only        trad_only          shared      
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  8.00   1st Qu.: 1.000   1st Qu.:  1.00   1st Qu.:  5.25  
##  Median : 13.00   Median : 3.000   Median :  2.00   Median :  9.00  
##  Mean   : 24.39   Mean   : 7.305   Mean   : 10.21   Mean   : 12.92  
##  3rd Qu.: 20.00   3rd Qu.: 7.750   3rd Qu.:  7.00   3rd Qu.: 14.00  
##  Max.   :203.00   Max.   :69.000   Max.   :188.00   Max.   :119.00  
##                   NA's   :3        NA's   :3        NA's   :3       
##          continent         system    lentic_lotic  marker_count  
##  Africa       : 1   freshwater:104   lentic:77    Min.   :1.000  
##  Asia         :40   marine    : 17   lotic :27    1st Qu.:1.000  
##  Australia    : 5                    marine:17    Median :1.000  
##  Europe       :57                                 Mean   :1.347  
##  North America:16                                 3rd Qu.:2.000  
##  South America: 2                                 Max.   :4.000  
##                                                                  
##           locus       marker_cat                             gear_type 
##  12s         :74   multiple:37   nets_traps_electrofishing_angling:31  
##  12s_cytb    :19   single  :84   electrofishing                   :23  
##  12s_16s     :11                 unspecified                      :11  
##  co1         : 5                 nets                             :10  
##  12s_co1     : 4                 nets_traps_electrofishing        : 9  
##  12s_16s_cytb: 3                 traps                            : 9  
##  (Other)     : 5                 (Other)                          :28  
##         gear_cat  match_effort    volume_L           sample_depth
##  multiple   :75   no :85       Min.   :   0.60   multidepth:19   
##  single     :42   yes:36       1st Qu.:   1.00   subsurface: 5   
##  unspecified: 4                Median :  20.00   surface   :97   
##                                Mean   : 111.25                   
##                                3rd Qu.:  54.75                   
##                                Max.   :3540.00                   
##                                NA's   :3                         
##      sample_type         capture_type  Filter_type  pore_size_µm
##  continuous: 19   filtration   :107   GF     :38   0.45   :43   
##  separate  :102   precipitation: 14   PES    :19   0.7    :34   
##                                       CN     :18          :15   
##                                              :15   1      :13   
##                                       MCE    :15   0.22   : 8   
##                                       VigiDNA: 6   1.2    : 5   
##                                       (Other):10   (Other): 3   
##            Ext_method     fish_reads   primer_count  
##  DNeasy BT      :55            :103   Min.   :1.000  
##  QIAamp Tissue  :22   1,050,515:  1   1st Qu.:1.000  
##  PowerWater     :21   1,071,824:  1   Median :1.000  
##  PowerSoil      : 8   1,178,682:  1   Mean   :1.455  
##  NucleoSpin Soil: 7   104,055  :  1   3rd Qu.:2.000  
##  PCI            : 5   2,543,400:  1   Max.   :4.000  
##  (Other)        : 3   (Other)  : 13                  
##                                                   primer_refs
##  MiFish-U (Miya 2015)                                   :36  
##  12S teleo (Valentini 2016)                             :34  
##  12S (Riaz 2011), CytB (Kocher 1989)                    :15  
##  12S (Riaz 2011), 16S (Deagle 2009)                     :10  
##  MiFish-U (Miya 2015), LerayXT, SeaDNA-short, SeaDNA-mid: 4  
##  L14735/H15149c, Am12S (Evans 2016), Ac16S (Evans 2016) : 2  
##  (Other)                                                :20  
##              primer_bp     pcr_reps                  seq_platform
##  171              :33   Min.   : 2.000   Illumina MiSeq    :103  
##  60               :33   1st Qu.: 3.000   Illumina HiSeq2500:  7  
##                   :15   Median : 8.000   Ion Torrent PGM   :  4  
##  106, 414         :15   Mean   : 7.402   Ion Chef          :  2  
##  171, 313, 55, 130: 4   3rd Qu.:12.000   Roche             :  2  
##  170              : 2   Max.   :12.000   BGISEQ-500        :  1  
##  (Other)          :19   NA's   :19       (Other)           :  2  
##     X          
##  Mode:logical  
##  NA's:121      
##                
##                
##                
##                
## 
##        author year_pub                    site dna_richness trad_richness
## 1 Balasingham     2018        Grand River, CAN           50            47
## 2 Balasingham     2018     Sydenham River, CAN           54            56
## 3      Bessey     2020      Browse Island, AUS          133           185
## 4    Bylemans     2018 Murrumbidgee River, AUS           13             7
## 5     Cantera     2019           French Guiana          106           203
## 6    Cilleros     2018           French Guiana          131           203
##   dna_only trad_only shared     continent     system lentic_lotic marker_count
## 1       12         9     38 North America freshwater        lotic            1
## 2       19        21     35 North America freshwater        lotic            1
## 3       66       120     67     Australia     marine       marine            1
## 4        6         0      7     Australia freshwater        lotic            1
## 5       NA        NA     NA South America freshwater        lotic            1
## 6       12        84    119 South America freshwater        lotic            1
##   locus marker_cat                 gear_type gear_cat match_effort volume_L
## 1   co1     single                      nets multiple           no     90.5
## 2   co1     single nets_traps_electrofishing multiple           no     83.0
## 3   16s     single                    visual multiple           no     32.7
## 4   12s     single            electrofishing   single           no     96.0
## 5   12s     single             nets_toxicant multiple           no   1700.0
## 6   12s     single             nets_toxicant multiple           no   1950.0
##   sample_depth sample_type capture_type Filter_type pore_size_µm
## 1   multidepth    separate   filtration          GF          1.2
## 2   multidepth    separate   filtration          GF          1.2
## 3      surface    separate   filtration         MCE 0.2 and 0.45
## 4      surface    separate   filtration          GF          1.2
## 5      surface  continuous   filtration     VigiDNA         0.45
## 6      surface  continuous   filtration         PES            1
##        Ext_method fish_reads primer_count                primer_refs primer_bp
## 1             PCI    344,725            1 PS1 COI (Balasingham 2018)       247
## 2             PCI    610,087            1 PS1 COI (Balasingham 2018)       247
## 3       DNeasy BT  1,071,824            1           16S (Berry 2017)          
## 4      PowerWater                       1       MiFish-U (Miya 2015)       171
## 5 NucleoSpin Soil 22,488,969            1 12S teleo (Valentini 2016)        60
## 6       DNeasy BT  1,050,515            1 12S teleo (Valentini 2016)        60
##   pcr_reps                 seq_platform  X
## 1       NA                     Ion Chef NA
## 2       NA                     Ion Chef NA
## 3        2               Illumina MiSeq NA
## 4        3               Illumina MiSeq NA
## 5       12           Illumina HiSeq2500 NA
## 6       12 Illumina MiSeq and HiSeq2500 NA

Descriptive Statistics & Plots

##       mean       sd
## 1 21.03306 25.57177
## # A tibble: 2 x 3
##   system      mean    sd
##   <fct>      <dbl> <dbl>
## 1 freshwater  16.9  21.9
## 2 marine      46.1  32.1
##       mean       sd
## 1 24.38843 37.30223
## # A tibble: 2 x 3
##   system      mean    sd
##   <fct>      <dbl> <dbl>
## 1 freshwater  18.1  29.4
## 2 marine      63.1  54.9

Step 2: Check dataset assumptions.

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_data
## W = 0.73825, p-value = 2.126e-13
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_data_1
## W = 0.75797, p-value = 8.43e-12
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_data_2
## W = 0.94867, p-value = 0.4358

Step 3: Set null and alternative hypotheses

H~o1~: There is no relationship between traditional and DNA metabrcoding richness  
H~a1~: There is a relationship between traditional and DNA metabrcoding richness    

H~o2~: There is no relationship between traditional and DNA metabrcoding richness in freshwater systems  
H~a2~: There is a relationship between traditional and DNA metabrcoding richness in freshwater systems  

H~o3~: There is no relationship between traditional and DNA metabrcoding richness in marine systems  
H~a3~: There is a relationship between traditional and DNA metabrcoding richness in marine systems  

Step 4: Set α, level of significance

α = 0.05  

Step 5: Conduct the appropriate test in R

Using Conventional Analyses: Linear Regression

## 
## Call:
## lm(formula = dna_richness ~ trad_richness, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.086  -5.892  -2.449   3.540  60.568 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.70642    1.68632    4.57  1.2e-05 ***
## trad_richness  0.54643    0.03795   14.40  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.51 on 119 degrees of freedom
## Multiple R-squared:  0.6354, Adjusted R-squared:  0.6323 
## F-statistic: 207.4 on 1 and 119 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = dna_richness ~ trad_richness, data = freshwater)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.977  -3.982  -1.509   3.232  52.147 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.81395    1.10534   4.355 3.17e-05 ***
## trad_richness  0.67076    0.03213  20.876  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.593 on 102 degrees of freedom
## Multiple R-squared:  0.8103, Adjusted R-squared:  0.8085 
## F-statistic: 435.8 on 1 and 102 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = dna_richness ~ trad_richness, data = marine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.432 -18.290  -1.869   9.810  59.085 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    30.0901    11.2392   2.677   0.0172 *
## trad_richness   0.2542     0.1361   1.867   0.0816 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.91 on 15 degrees of freedom
## Multiple R-squared:  0.1886, Adjusted R-squared:  0.1345 
## F-statistic: 3.485 on 1 and 15 DF,  p-value: 0.08158
##           R
## 1 0.7970958
## # A tibble: 2 x 2
##   system         R
##   <fct>      <dbl>
## 1 freshwater 0.900
## 2 marine     0.434
## [1] 0.6353618
## [1] 0.8103448
## [1] 0.1885516

Lin’s Concordance Correlation Coefficient (CCC)

##         CCC     Lower     Upper
## 1 0.7379296 0.6691936 0.8051726
##         CCC    Lower     Upper
## 1 0.8609523 0.816461 0.8988629
##         CCC      Lower    Upper
## 1 0.3426183 0.01196099 0.657929

In marine systems, CCC confidence interval overlaps 0, and thus there is no significant difference from 0.

Bland-Atman Analysis

The blue band is the mean (dashed line) with 95% confidence intervals, and the green and red bands indicate two standard deviations above and below the mean respectively with 95% confidence intervals. Anything above the mean difference would mean eDNA metabarcoding identifying more species, and anything below the mean difference would indicate conventional methods identifying more species.

Step 6: If p < α, reject null.

Reject Ho1
Reject Ho2
Accept Ho3

Step 7: Draw conclusion.

A linear regression was run to determine the relationship between traditional methods of measuring species richness and eDNA metabarcoding methods in both marine and freshwater systems. In both systems, there was a significant positive relationship (F(1, 119) = 207.4, p < 2.2e-16), with approximately 64% of variation explained (R^2 = 0.6353618). However, separating the measured species richness by system, there is only a significant relationship between the methods in freshwater systems (F(1, 102) = 435.8, p < 2.2e-16), where there is a positive relationship and 81% of variation is explained (R^2 = 0.8103448).

Lin’s CCC showed moderate agreement in both systems (n = 121, CCC = 0.74, CI95%: 0.66, 0.80), which indicates agreement between traditional and eDNA metabarcoding methods. However, evaluating methods by different systems showed only good agreement in freshwater systems (n = 104, CCC = 0.86, CI95%: 0.81, 0.90) but not in marine systems (n = 17, CCC = 0.35, CI95%: -0.04, 0.65). Lin’s CCC is a more appropriate measure for evaluating if eDNA metabarcoding methods measure the same amount of species richness as conventional methods. High correlation does not mean there is agreement.

Bland-Altman plots were also constructed between traditional and metabarcoding methods for both systems, freshwater systems, and marine systems. The results were similar to Lin’s CCC results where eDNA metabarcoding agrees with, or calibrates well with traditional methods of measuring species richness.

Based on these results, it can be concluded that eDNA methods perform just as well as traditional methods in freshwater systems. It is not clear if there is agreement in marine systems and there were not as many observational sites in marine systems than freshwater systems. Some species were only detected by one or the other method, meaning they would work best as complementary methods to capture all species diversity.

Citation of Dataset

McElroy, Mary E.; Dressler, Terra L.; Titcomb, Georgia C.; Wilson, Emily A.; Deiner, Kristy; Dudley, Tom L.; et al. (2020): Table_7_Calibrating Environmental DNA Metabarcoding to Conventional Surveys for Measuring Fish Species Richness.XLSX. Frontiers. Dataset. https://doi.org/10.3389/fevo.2020.00276.s007

References

McElroy, M. E., Dressler, T. L., Titcomb, G. C., Wilson, E. A., Deiner, K., Dudley, T. L., … & Lamberti, G. A. (2020). Calibrating environmental DNA metabarcoding to conventional surveys for measuring fish species richness. Frontiers in Ecology and Evolution, 8, 276.

Lawrence, I., & Lin, K. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 255-268.

Giavarina D. (2015). Understanding Bland Altman analysis. Biochemia medica, 25(2), 141–151. https://doi.org/10.11613/BM.2015.015

Altman DG. (1991). Practical statistics for medical research. London: Chapman and Hall.

Bland, J. M., and Altman, D. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327, 307–310. doi: 10.1016/s0140-6736(86)90837-8

R Packages Used

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2. https://CRAN.R-project.org/package=dplyr

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Datta, D. (2017). blandr: a Bland-Altman Method Comparison package for R. Zenodo. DOI:10.5281/zenodo.824514 https://github.com/deepankardatta/blandr

Dai Feng (2020). agRee: Various Methods for Measuring Agreement. R package version 0.5-3. https://CRAN.R-project.org/package=agRee

Andri Signorell et mult. al. (2020). DescTools: Tools for descriptive statistics. R package version 0.99.38.