Mapping reads back to the assembly¶
Overview¶
There are many different mappers available to map your reads back to the assemblies. Usually they result in a SAM or BAM file (http://genome.sph.umich.edu/wiki/SAM). Those are formats that contain the alignment information, where BAM is the binary version of the plain text SAM format. In this tutorial we will be using bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
The SAM/BAM file can afterwards be processed with Picard (http://picard.sourceforge.net/) to remove duplicate reads. Those are likely to be reads that come from a PCR duplicate (http://www.biostars.org/p/15818/).
BEDTools (http://code.google.com/p/bedtools/) can then be used to retrieve coverage statistics.
There is a script available that does it all at once. Read it and try to understand what happens in each step:
less `which map-bowtie2-markduplicates.sh`
map-bowtie2-markduplicates.sh -h
Bowtie2 has some nice documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Question: what does bowtie2-build do?
Picard’s documentation also exists! Two bioinformatics programs in a row with decent documentation! Take a moment to celebrate, then have a look here: http://sourceforge.net/apps/mediawiki/picard/index.php
Question: Why not just remove all identitical pairs instead of mapping them and then removing them?
Question: What is the difference between samtools rmdup and Picard MarkDuplicates?
Mapping reads with bowtie2¶
Take an assembly and try to map the reads back using bowtie2. Do this on an interactive node again, and remember to change the ‘out_21’ part to the actual output directory that you generated:
# Create a new directory and link files
mkdir -p ~/asm-workshop/bowtie2
cd ~/asm-workshop/bowtie2
ln -s ../velvet/out_21/contigs.fa contigs.fa
ln -s ../sickle/pair1.fastq pair1.fastq
ln -s ../sickle/pair2.fastq pair2.fastq
# Run the everything in one go script.
map-bowtie2-markduplicates.sh -t 1 -c pair1.fastq pair2.fastq pair contigs.fa contigs map > map.log 2> map.err
Inspect the map.log
output and see if all went well.
Question: What is the overall alignment rate of your reads that bowtie2 reports?
Add the answer to the doc.
Some general statistics from the SAM/BAM file¶
You can also determine mapping statistics directly from the bam file. Use for instance:
# Mapped reads only
samtools view -c -F 4 map/contigs_pair-smds.bam
# Unmapped reads only
samtools view -c -f 4 map/contigs_pair-smds.bam
From:
http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/.
The number is different from the number that bowtie2 reports, because these are
the numbers after removing duplicates. The -smds
part stands for running
samtools sort
, MarkDuplicates.jar
and samtools sort
again on the
bam file. If all went well with the mapping there should also be a
map/contigs_pair-smd.metrics
file where you can see the percentage of
duplication. Add that to the doc as well.
Coverage information from BEDTools¶
Look at the output from BEDTools:
less map/contigs_pair-smds.coverage
The format is explained here
http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html. The
map-bowtie2-markduplicates.sh
script also outputs the mean coverage per
contig:
less map/contigs_pair-smds.coverage.percontig
Question: What is the contig with the highest coverage? Hint: use sort -k