Assembling reads with Velvet

In this exercise we will learn how to perform an assembly with Velvet. Velvet takes your reads as input and turns them into contigs. It consists of two steps. In the first step, velveth, the de Bruijn graph is created. Afterwards the graph is traversed and contigs are created with velvetg. When constructing the de Bruijn graph, a kmer has to be specified. Reads are cut up into pieces of length k, each representing a node in the graph, edges represent an overlap (some de Bruijn graph assemblers do this differently, but the idea is the same). The advantage of using kmer overlap instead of read overlap is that the computational requirements grow with the number of unique kmers instead of unique reads. A more detailled explanation can be found at

Pick a kmer

Please work in pairs for this assignment. Every group can select a kmer of their likings - pick a random one if you haven’t developed a preference yet. Write you and your partner’s name down at a kmer on the Google doc for this workshop.


Create the graph data structure with velveth. Again like we did with sickle, first create a directory with symbolic links to the pairs that you want to use:

mkdir -p ~/asm-workshop/velvet
cd ~/asm-workshop/velvet
ln -s ../sickle/qtrim1.fastq pair1.fastq
ln -s ../sickle/qtrim2.fastq pair2.fastq

The reads need to be interleaved for velveth: pair1.fastq pair2.fastq pair.fastq

Run velveth over the kmer you picked (21 in this example):

velveth out_21 21 -fastq -shortPaired pair.fastq

Check what directories have been created:



To get the actual contigs you will have to run velvetg on the created graph. You can vary options such expected coverage and the coverage cut-off if you want, but we do not do that in this tutorial. We only choose not to do scaffolding:

velvetg out_21 -scaffolding no


After the assembly one wants to look at the length distributions of the resulting assemblies. You can use the assemstats script for that:

assemstats 100 out_*/contigs.fa

Try to find-out what each of the stats represent by varying the cut-off. One of the most often used statistics in assembly length distribution comparisons is the N50 length, a weighted median, where you weight each contig by it’s length. This way you assign more weight to larger contigs. Fifty procent of all the bases in the assembly are contained in contigs shorter or equal to N50 length. Once you have gotten an idea of what it all the stats mean, it is time to compare your results with the other attendees of this workshop. Generate the results and copy them to the doc:

assemstats 100 out_*/contigs.fa

Do the same for the cut-off at 1000 and add it to the doc. Compare your kmer against the others. If there are very little available yet, this would be an ideal time to help out some other attendees or do the same exercise for a kmer that has not been picked by somebody else yet. Please write down you and your partners name again at the doc in that case.

Question: What are the important length statistics? Do we prefer sum over length? Should it be a combination?

Think of a formula that could indicate the best preferred length distribution where you express the optimization function in terms of the column names from the doc. For instance only n50_len or sum * n50_len.

(Optional exercise) Ray

Try to create an assembly with Ray over the same kmer. Ray is an assembler that uses MPI to distribute the assembly over multiple cores and nodes. The latest version of Ray was made to work well with metagenomics data as well:

mkdir -p ~/asm-workshop/ray
cd ~/asm-workshop/ray
ln -s ../sickle/qtrim1.fastq pair1.fastq
ln -s ../sickle/qtrim2.fastq pair2.fastq
mpiexec -n 1 Ray -k 21 -p pair1.fastq pair2.fastq -o out_21

Add the assemstats results to the doc as you did for Velvet. There is a separate tab for the Ray assemblies, compare the results with Velvet.

(Optional exercise) VelvetOptimiser

VelvetOptimiser is a script that runs Velvet multiple times and follows the optimization function you give it. Use VelvetOptimiser to find the assembly that gets the best score for the optimization function you designed in assemstats. It requires BioPerl, which you can get on uppmax with module load BioPerl.