Discussion:
PHYLIP and DNADIST
(too old to reply)
Chris Hoffmann
2007-06-27 15:44:37 UTC
Permalink
Hi everybody,
I was wondering about DNADIST, from the PHYLIP package.
I am conducting a big sequencing project and there will be several phases. I
would like to construct a distance matrix using DNADIST with a initial
dataset and later on only add more sequences to the set. but I didn't want
to have to re-run the program with all the sequences again. is there a way
to only insert the new data into the matrix?
For example:
initially I want calculate the distances from sequences in group of
sequences A;
then when I get group of sequences B, calculate the distances within
sequences in group B;
and calculate the distances between sequences in group A and B without
having to re-calculate the distances for group A again.
Tthis is a simple example, I am actually likely to have 5 or more sets of
sequences, ranging from 5000 to 20000 sequences per group (perhaps more).
I realize I may have to adapt the code (another issue entirely) but what I
am concerned is if the methods used by DNADIST give reliable results if I
calculate them in this fashion.
I wanted to use the F84 model, the default, but I am open to suggestions.
Any help on this would be great.
Thanks
Chris
Joe Felsenstein
2007-06-30 02:01:22 UTC
Permalink
Post by Chris Hoffmann
I was wondering about DNADIST, from the PHYLIP package.
I am conducting a big sequencing project and there will be several phases. I
would like to construct a distance matrix using DNADIST with a initial
dataset and later on only add more sequences to the set. but I didn't want
to have to re-run the program with all the sequences again. is there a way
to only insert the new data into the matrix?
initially I want calculate the distances from sequences in group of
sequences A;
then when I get group of sequences B, calculate the distances within
sequences in group B;
and calculate the distances between sequences in group A and B without
having to re-calculate the distances for group A again.
Tthis is a simple example, I am actually likely to have 5 or more sets of
sequences, ranging from 5000 to 20000 sequences per group (perhaps more).
I realize I may have to adapt the code (another issue entirely) but what I
am concerned is if the methods used by DNADIST give reliable results if I
calculate them in this fashion.
1. Dnadist will not add the new distances without recomputing the old ones
in this way.
2. In any case, for the F84 distances the formulas use the base frequencies
found (empirically) in the input sequences. If you add more input
sequences you then most likely have slightly altered empirical frequencies
so you want to recompute the original ones anyway.
3. I suspect our formulas can compute this many distances, but
4. With 20,000 sequences there are 400,000,000 distances in all which, if each
is about 10 bytes long, is a table 4 GB in size. That is too big to
use. You ought to therefore reconsider your motivation for doing this.

I have posted this rather than emailing to the original poster because
it might be educational for others using our programs.

----
Joe Felsenstein ***@removethispart.gs.washington.edu
Department of Genome Sciences and Department of Biology,
University of Washington, Box 355065, Seattle, WA 98195-5065 USA

Loading...