Search amino acid sequences with HMMER against the Pfam database¶
It is time to do the actual Pfam annotation of our metagenomes!
Running hmmsearch
on the translated sequence data sets¶
Before we run hmmsearch
, we will look at its available options:
hmmsearch -h
As you will see, the program takes a substantial amount of arguments.
In this workshop we will work with the table output from HMMER, which
you get by specifying the --tblout
option together with a file
name. We also want to make sure that we only got statistically
relevant matches, which we can do using the E-value option. The
E-value (Expect-value) is an estimation of how often we would expect
to find a similar hit by chance, given the size of the database. To
avoid getting a lot of noise matches, we will specify and E-value of
10^-5, that is that we would by chance get a match with a similarly good
alignment in 1 out of 100000 cases. This can be set with the -E 1e-5
option. Finally, to speed up the process a little, we will use the
--cpu
option to get multi-core support. On the Uppmax machines you can
use up to 16 cores for the HMMER runs.
To specify the HMM-file database and the input data set, we just type in
the names of those two files at the end of the command. Finally we add in
the > /dev/null
string, to avoid getting the screen cluttered with
sequence alignments that HMMER outputs. That should give us the following
command:
hmmsearch --tblout <output file> -E 1e-5 --cpu 8 ~/Pfam/Pfam-mobility.hmm <input file (protein format)> > /dev/null
Now run this command on all four input files that we just have downloaded. When the command has finished for all files, we can move on to the normalization exercise.