Point Specific Mutation Matrix vs. profile HMM ?

Discussion:

(too old to reply)

harald

2006-02-12 14:40:08 UTC

Hi there.

I am a student in computer-science working on my first project in
sequence-analysis. My first step to do is to extract a some homologs of
a query-protein from a big database. According to my investigations in
textbooks and the web, I am goint to use PSI-BLAST or HMMER. According
to my impressions, I can not see a big difference between the Point
Specific Mutation Matrix which is used by PSI-BLAST and the profile HMM
used by HMMER, except the possibility to compute a profile from
unaligned sequences in the case of the profile HMM.

Now I have two questions:

1 - Are the following both approaches nearly equivalent?
A - Conduct a BLAST-Search with the query-protein.
Take the first hits and make a multiple alignment by CLUSTALW or alike.
Use this alignment to train a HMM model.
Use the hmmsearch to search the database with the model.
B - Use two iterations of PSI-BLAST to search for the query-protein.

2 - the HMMER-tool hmmt which was used to build a profile HMM out of an
unaligned set of sequences is not part of HMMER anymore. Does anyone
knows why? Is it generally better to multiple align a set of sequences
before training a HMM model instead of using hmmt?

With kind regards,
Harald

Michael Spitzer

2006-02-12 21:19:31 UTC

Permalink

Hi,

[...] I can not see a big difference between the Point
Specific Mutation Matrix which is used by PSI-BLAST and the profile HMM
used by HMMER, except the possibility to compute a profile from
unaligned sequences in the case of the profile HMM.

There are two papers (I know of) that compare profile HMMs and
PSI-BLAST quite nicely:

Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia
C. Sequence comparisons using multiple sequences detect three times as
many remote homologues as pairwise methods. J Mol Biol. 1998 284(4):
1201-1210

Madera M, Gough J. A comparison of profile hidden Markov model
procedures for remote homology detection. Nucleic Acids Res. 2002
30(19): 4321-4328

Both conclude (more or less) that profile HMMs retrieve approx. 10%
more true homologs than PSI-BLAST. The main (and only?) disadvantages
of profile HMMs seem to be higher computational cost and a less solid
statistical theory corresponding to E-value calculation, compared to
PSI-BLAST. The latter may have changed by now, but I'm not aware of
current publications on this at the moment.

HTH, Micha

Kevin Karplus

2006-02-14 00:40:43 UTC

Permalink

Post by Michael Spitzer
The main (and only?) disadvantages
of profile HMMs seem to be higher computational cost and a less solid
statistical theory corresponding to E-value calculation, compared to
PSI-BLAST. The latter may have changed by now, but I'm not aware of
current publications on this at the moment.

Look at

Kevin Karplus, Rachel Karchin, George Shackleford, and Richard
Hughey. Calibrating E-values for hidden Markov models with
reverse-sequence null models.

Bioinformatics 2005 21(22):4107-4115; doi:10.1093/bioinformatics/bti629

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/bti629

------------------------------------------------------------
Kevin Karplus ***@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
(Senior member, IEEE) (Board of Directors & Chair of Education Committee, ISCB)
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Affiliations for identification only.

Michael Spitzer

2006-02-19 18:50:24 UTC

Permalink

Hi,

Post by Kevin Karplus
Kevin Karplus, Rachel Karchin, George Shackleford, and Richard
Hughey. Calibrating E-values for hidden Markov models with
reverse-sequence null models.

Thanks for the reference! I didn't watch HMM literature very closely
for the last year or so. Too late though for inclusion in my thesis
(just printed), but a good item for the defense, if the topic arises.
:-)

Cheers, Michael

Kevin Karplus

2006-02-13 00:35:51 UTC

Permalink

Post by harald
1 - Are the following both approaches nearly equivalent?
A - Conduct a BLAST-Search with the query-protein.
Take the first hits and make a multiple alignment by CLUSTALW or alike.
Use this alignment to train a HMM model.
Use the hmmsearch to search the database with the model.
B - Use two iterations of PSI-BLAST to search for the query-protein.

These are fairly close equivalents---the biggest difference is in the
quality of the multiple alignment and in the handling of gaps. (The
HMM method has position-specific gap costs.)

Post by harald
2 - the HMMER-tool hmmt which was used to build a profile HMM out of an
unaligned set of sequences is not part of HMMER anymore. Does anyone
knows why? Is it generally better to multiple align a set of sequences
before training a HMM model instead of using hmmt?

HMMer was built primarily as a support for Pfam, which does not
require fancy training of HMMs. If You want to build HMMs from
unaligned sequences, you are better off using the SAM package.
[Julian Gough, Kevin Karplus, Richard Hughey, and Cyrus Chothia.
Assignment of homology to genome sequences using a library of hidden
Markov models that represent all proteins of known structure.
<i>Journal of Molecular Biology</I>, 313:903--919, 2001.
]

The current release of SAM
http://www.soe.ucsc.edu/research/compbio/sam.html
(free to academics, non-profits, and government labs) includes the
target04 script, which does a similar job to that of psiblast, but
slower and better. If you only have a few sequences, you can use the
SAM-T02 website, which uses the older target2k script, which is not
quite as sensitive.

If you are willing to risk crashes, there is a beta test site up for
SAM-T05 server, which does both the target2k and the target04 scripts:
http://www.soe.ucsc.edu/research/compbio/SAM_T05test
It also does local structure prediction, tertiary structure
prediction, and contact prediction though there are some bugs still to
be worked out on the tertiary predictions before the new CASP season.

Disclaimer: I am one of the developers of SAM and the target2k and
target04 scripts---my views about HMMer and psiblast may be biased.

------------------------------------------------------------
Kevin Karplus ***@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
(Senior member, IEEE) (Board of Directors & Chair of Education Committee, ISCB)
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Affiliations for identification only.

harald

2006-02-26 13:57:38 UTC

Permalink

Hello,

thanks a lot for your quick and detailed answers.
The papers comparing PSI-Blast and HMM profiles and the one about the
statistical theory were pretty interesting.

But since the database, which I want to search for homologs is very big
(~3 Mio. sequences), I think that a tool like hmmsearch would be too slow.

Kind regards,
Harald

Kevin Karplus

2006-02-26 21:54:26 UTC

Permalink

Post by harald
thanks a lot for your quick and detailed answers.
The papers comparing PSI-Blast and HMM profiles and the one about the
statistical theory were pretty interesting.
But since the database, which I want to search for homologs is very big
(~3 Mio. sequences), I think that a tool like hmmsearch would be too slow.

If you are doing 3 million vs. 3 million, then HMM-based methods are
probably too slow for you. If you are doing a few hundred vs. 3
million, then HMM-based methods are OK. It takes a while to do the
iterative search and alignment needed to build a decent HMM, but
scoring sequences with it is not too terrible. I routinely score all
of PDB (about 22,000 sequences), and it usually takes a couple of
minutes for a 140-long HMM. Since running time is proportional to the
number of characters, scoring 3 million sequences would take about 5
hours (less on a more modern computer). This is feasible for hundreds
of models, but not millions of models.

------------------------------------------------------------
Kevin Karplus ***@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
(Senior member, IEEE) (Board of Directors & Chair of Education Committee, ISCB)
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Affiliations for identification only.