Comparative taxonomic analysis with R

KRONA is not very good for comparing multiple samples though. Instead we will use R as for the functional data. First combine the data from the different samples with the script sum_rdp_annot.pl (made by us) into a table. By default the script uses a bootstrap support of 0.80 to include a taxonomic level (this can be changed easily by changing the number on row 16 in the script). You may need to change the input file names in the beginning of the script ($in_file[0] = ...). The script will only import the 16S bacterial data:

cd ~/metagenomics/cta/rdp
sum_rdp_annot.pl > summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv

Let’s import this table into R:

tab_tax <- read.delim("summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv")

And assign the descriptor column to a vector:

tax <- tab_tax[,1]

And put the counts into a matrix:

tax_counts <- tab_tax[,2:ncol(tab_tax)] # counts of taxa in the different samples

Since you will compare this dataset with the functional dataset you generated before it’s great if the samples come in the same order. Check the previous order:

sample

And the current order:

colnames(tax_counts)

if they are not in the same order contact an assistant for help. Otherwise:

colnames(tax_counts) <- sample
rownames(tax_counts) <- tax

Make a normalised version of tax_counts:

norm_tax_counts <- tax_counts
for (i in 1:ncol(tax_counts)) {
    norm_tax_counts[,i] <- tax_counts[,i]/sum(tax_counts[,i])
}

What different taxa do we have there?:

tax

From the tax_counts matrix we can create new matrices at defined taxonomic levels. If you open the text file /proj/g2013206/metagenomics/r_commands.txt you can copy and paste all of this code into R (or use the source command) and this will give you the matrices and vectors below (check carefully that you didn’t get any error messages!):

phylum_counts             matrix with counts for different phyla
norm_phylum_counts          normalised version of phylum_count
phylum              vector with phyla (same order as in phyla matrix)

class_counts             matrix with counts for different classes
norm_class_counts        normalised version of class_count
class                 vector with classes

phylumclass_counts         matrix with counts for different phyla and proteobacteria classes
norm_phylumclass_counts    normalised version of phylumclass_count
phylumclass              vector with phyla and proteobacteria classes

The “other” clade in each of the above sums reads that were not classified at the defined level. Having these more well defined matrices we can compare the taxonomic composition in the samples. We can apply the commands that we did for the functional analysis:

library(vegan)
library(RColorBrewer)

Barplots:

par(mar=c(1,2,1,22)) # Increase the MARgins, to make space for a legend

barplot(norm_phylum_counts, col = rainbow(11), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+25, y=1, adj=c(0,0)))
barplot(norm_phylumclass_counts, col = rainbow(15), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+30, y=1, adj=c(0,0)))
barplot(norm_class_counts, col = rainbow(18), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+32, y=1, adj=c(0,0)))

If you can’t see the legends, they’re just in a bad position. Try altering the x and y parameters in the args.legend.

Calculate beta-diversity based on class-level taxonomic counts:

class_dist <- as.matrix(vegdist(t(norm_class_counts[1:(nrow(norm_class_counts) - 1),]), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE))

Note that by “[1:(nrow(norm_class_counts) - 1),]” we exclude the last row in norm_class_counts when we calculate the distances because this is the “others” column that contains all kinds of unclassified taxa.

Hierarchical clustering:

library(cluster)
cluster <- agnes(class_dist, diss = TRUE, method = "average")
plot(cluster, which.plots = 2, hang = -1)

Heatmaps with clusterings:

heatmap(norm_class_counts, scale = "none")
heatmap(norm_phylumclass_counts, scale = "none")

And ordinate the data by NMDS:

color = brewer.pal(length(sample), "Reds")
mds <- metaMDS(class_dist)
plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)

Does the pattern look similar as that obtained by functional data?:

mds <- metaMDS(cogc_dist)
plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)

We can actually check how beta diversity generated by the two approaches is correlated:

plot(cogf_dist, class_dist)
cor.test(cogf_dist, class_dist)

(For comparing matrices it is common to use a mantel test, but the r-value (but not the p-value) is in fact the same.)

Finally, let’s check how alpha-diversity fluctuates over the year and compares between taxonomic and functional data. Since alpha-diversity is influenced by sample size it is advisable to subsample the datasets to the same number of reads. We can make a subsampled table using the vegan function rrarefy:

sub_class_counts <- t(rrarefy(t(class_counts), 100))

This will be difficult to achieve for the functional data at this point, however, so let’s skip that for the functional data.

Let’s use Shannon diversity index since this is pretty insensitive to sample size. Shannon index combines richness (number of species) and evenness (how evenly the species are distributed); many, evenly distributed species gives a high Shannon. There is a vegan function for getting shannon:

class_shannon <- diversity(class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2)
sub_class_shannon <- diversity(sub_class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2)
cogf_shannon <- diversity(cogf_cov, MARGIN = 2)

How does subsampling influence shannon?:

plot(class_shannon, sub_class_shannon)

Is functional and taxonomic shannon correlated?:

plot(sub_class_shannon, cogf_shannon)
cor.test(sub_class_shannon, cogf_shannon)