Discussion:
[Computational-biology] Re: processor system
(too old to reply)
e***@cox.net
2006-04-24 19:39:20 UTC
Permalink
Thank you so much for the information, I'm a student working on a project at GMU in Fairfax VA for a High performance computer. Would there be any interest from bioinformatic companies in an inexpensive multiprocessor environment with a linux OS that would have a faster speed than desktops by using parallel processing?
Date: 2006/04/24 Mon AM 11:15:54 EDT
Subject: [Computational-biology] Re: processor system
Hello, I was wondering who could anyone tell me the processor system that
they use or is most commonly used for the Blast or any of bioinformatic
algorithms?
Bioinformatics algorithms are run a variety of different machines.
Most popular are Linux boxes (with various Intel or AMD processors),
because they provide the cheapest compute power. One also sees a lot
of Mac OS X boxes (G5 and G4, particularly G4 laptops) and occasional
Windows machines or Solaris machines.
Note: I have mentioned operating systems rather than processors,
because they have more effect on whether or not a particular
application can run.
Almost all applications that have much penetration in the field run
under most versions of Unix and Linux, that being the closest thing
the field has to a vendor-independent system.
------------------------------------------------------------
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
(Senior member, IEEE) (Board of Directors & Chair of Education Committee, ISCB)
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Affiliations for identification only.
_______________________________________________
Comp-bio mailing list
http://www.bio.net/biomail/listinfo/comp-bio
Kevin Karplus
2006-04-24 20:00:54 UTC
Permalink
There has been some interest in parallel processing for bioinformatics
applications, but most bioinformatics applications are "embarassingly
parallel"---that is they consist of doing thousands of independent
computations, so simple clusters work fine. Most "high performance
computing" projects are designed by and for physicists---the tight
coupling they need for handling differential equations on large grids
are simply irrelevant to most bioinformatics applications.

The big problem is not multi-processing, but properly distributing and
maintaining data. You can't afford to have 1000 processors hitting on
a single file server---even 40 processors will bring most servers to
their knees. But making 1000s of copies of terabytes of data is also
impractical.

The real problems in computer architecture for bioinformatics have to
do with handling the data, not the computation. We need
high-performance data architectures, which the computer engineering
community has not paid nearly enough attention to.

Kevin Karplus
------------------------------------------------------------
Kevin Karplus ***@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
(formerly a Computer Engineering professor)
Scott Harper
2006-05-09 20:09:09 UTC
Permalink
On Mon, 24 Apr 2006 13:00:54 -0700, Kevin Karplus wrote:
[...]
The real problems in computer architecture for bioinformatics have to do
with handling the data, not the computation. We need high-performance
data architectures, which the computer engineering community has not paid
nearly enough attention to.
I agree that the main issue in bioinformatic processing is data movement.
The paragraph above may be a bit misleading, however. There is an active
community of researchers investigating the issues of parallel processing
for bioinformatics. All of the efforts consider both data movement and
processing issues.

Academically, efforts like MPIBlast and BeoBlast tend to target clusters
of standard servers. Clusters running this type of software are already
in active use by research centers around the world. Unfortunately, data
movement can be difficult in clusters, hindering realization of potential
performance. Of course, large clusters also tend to be expensive to build,
house, and maintain.

Commercially, companies like Adaptive Genomics, Cray Inc, Paracel,
TimeLogic, and (to some extent) Starbridge Systems have all made inroads
into parallel processing for bioinformatic data. Paracel and Starbridge
have recently left the commercial bioinformatic arena, but the efforts
of all of these companies have resulted in significant boosts to
biosequence data processing rates. For example, Cray has benchmarked a
Smith-Waterman algorithm (SSEARCH34) on a single 64-bit AMD Opteron at 100
million cell updates / sec (MCUPS). Aligning the Human X and Y genes at
this rate would take (154824267*57701691 cells) / (100 MCUPS) seconds, or
about 2.8 years. An algorithm implemented on the new Cray XD1 claims to
speed up this analysis by 28 times, dropping the alignment time to 36.5
days. Tests run on the latest base model HyperSeq system from Adaptive
Genomics show a reduction of this wait time to less than 30 hours. Both
of these improvements were the result of designs that synergistically
considered basic processing needs and data flow through the systems.
Clearly architectures have been moving forward with a combined approach to
processing and data flow that provides significant improvements to
bioinformatic processing.

Addressing the original poster's question, the bioinformatic community
does have an interest in multiprocessing. Whether that interest
originates at bioinformatic companies or end users may be something of a
question. If GMU can develop a system to increase processing rates beyond
those provided by the current crop of sequence alignment systems, I'm
sure someone would be interested. System results should at least be
publication-worthy, even if they fail to attract corporate attention.
Sequence databases are always growing in size, and someone is always
looking for a faster alignment system.

Some references for interested parties:
MPIBlast : http://mpiblast.lanl.gov/
BeoBlast :
http://bioinformatics.fccc.edu/software/OpenSource/beoblast/beoblast.shtml

Adaptive Genomics : http://www.adaptivegenomics.com
Cray Inc : http://www.cray.com/products/xd1/smithwaterman.html
TimeLogic : http://www.timelogic.com/
--
. Dr. Scott Harper
. Adaptive Genomics Corp.
. 620 N. Main St, Suite 103
. Blacksburg, VA 24060
. ***@AdaptiveGenomics.com, 540-552-2700
Eric Lynum
2006-05-11 00:51:43 UTC
Permalink
I'd like to know have there been any new computational architectures that
people are looking at today other than multiprocessing? As a hardware
engineer interested in the field of Bioinformatics since I'm not in the
field, it seems from reading the posts that Cray machines and Opteron
processors are used? Why is this, are there no other computational
architectures that are appropriate? Also, what algorithms require these
processors and why?

-----Original Message-----
From: comp-bio-***@oat.bio.indiana.edu
[mailto:comp-bio-***@oat.bio.indiana.edu] On Behalf Of Scott Harper
Sent: Tuesday, May 09, 2006 4:09 PM
To: bionet-biology-***@moderators.isc.org
Subject: Re: [Computational-biology] Re: processor system

On Mon, 24 Apr 2006 13:00:54 -0700, Kevin Karplus wrote:
[...]
The real problems in computer architecture for bioinformatics have to do
with handling the data, not the computation. We need high-performance
data architectures, which the computer engineering community has not paid
nearly enough attention to.
I agree that the main issue in bioinformatic processing is data movement.
The paragraph above may be a bit misleading, however. There is an active
community of researchers investigating the issues of parallel processing
for bioinformatics. All of the efforts consider both data movement and
processing issues.

Academically, efforts like MPIBlast and BeoBlast tend to target clusters
of standard servers. Clusters running this type of software are already
in active use by research centers around the world. Unfortunately, data
movement can be difficult in clusters, hindering realization of potential
performance. Of course, large clusters also tend to be expensive to build,
house, and maintain.

Commercially, companies like Adaptive Genomics, Cray Inc, Paracel,
TimeLogic, and (to some extent) Starbridge Systems have all made inroads
into parallel processing for bioinformatic data. Paracel and Starbridge
have recently left the commercial bioinformatic arena, but the efforts
of all of these companies have resulted in significant boosts to
biosequence data processing rates. For example, Cray has benchmarked a
Smith-Waterman algorithm (SSEARCH34) on a single 64-bit AMD Opteron at 100
million cell updates / sec (MCUPS). Aligning the Human X and Y genes at
this rate would take (154824267*57701691 cells) / (100 MCUPS) seconds, or
about 2.8 years. An algorithm implemented on the new Cray XD1 claims to
speed up this analysis by 28 times, dropping the alignment time to 36.5
days. Tests run on the latest base model HyperSeq system from Adaptive
Genomics show a reduction of this wait time to less than 30 hours. Both
of these improvements were the result of designs that synergistically
considered basic processing needs and data flow through the systems.
Clearly architectures have been moving forward with a combined approach to
processing and data flow that provides significant improvements to
bioinformatic processing.

Addressing the original poster's question, the bioinformatic community
does have an interest in multiprocessing. Whether that interest
originates at bioinformatic companies or end users may be something of a
question. If GMU can develop a system to increase processing rates beyond
those provided by the current crop of sequence alignment systems, I'm
sure someone would be interested. System results should at least be
publication-worthy, even if they fail to attract corporate attention.
Sequence databases are always growing in size, and someone is always
looking for a faster alignment system.

Some references for interested parties:
MPIBlast : http://mpiblast.lanl.gov/
BeoBlast :
http://bioinformatics.fccc.edu/software/OpenSource/beoblast/beoblast.shtml

Adaptive Genomics : http://www.adaptivegenomics.com
Cray Inc : http://www.cray.com/products/xd1/smithwaterman.html
TimeLogic : http://www.timelogic.com/
--
. Dr. Scott Harper
. Adaptive Genomics Corp.
. 620 N. Main St, Suite 103
. Blacksburg, VA 24060
. ***@AdaptiveGenomics.com, 540-552-2700
Scott Harper
2006-05-11 15:57:47 UTC
Permalink
Post by Eric Lynum
I'd like to know have there been any new computational architectures that
people are looking at today other than multiprocessing? As a hardware
That depends on what you mean by new architectures. Most of the
custom machines (FPGA or ASIC) are essentially new processing
architectures. Examples include the FPGA-based hardware units
from Adaptive Genomics and TimeLogic. Paracel's processing
hardware is an ASIC solution. All of these designs were
(probably) done by engineers working from the specification of an
algorithm. In the case of Adaptive Genomics, our goal was to
design a hardware solution that provides both flexibility and
high throughput for the Smith-Waterman algorithm. The result is
not based on any other particular processing architecture, although
it does make use of well-known matrix solution techniques. Of
FPGA-based solutions, Cray's XD1 solution is probably the closest to
a traditional computational architecture, since it was developed much
like a standard co-processor (their approach is described pretty well
in Steve Margerm's article "Reconfigurable Computing in Real-World
Applications," FPGA and Structured ASIC Journal, Feb 2006).

There are people developing Smith-Waterman (and other bioinformatic)
solutions for many other non-microprocessor architectures as well.
For examplt, Cray's traditional vector-based machines (not really "new")
have a bioinformatics library.
Post by Eric Lynum
engineer interested in the field of Bioinformatics since I'm not in the
field, it seems from reading the posts that Cray machines and Opteron
processors are used? Why is this, are there no other computational
Processors of of all sorts are used. Cray probably benchmarked
the SSEARCH34 algorithm on an Opteron since that processor is a
primary component of the XD1. Of course, these days comparisons
are generally made against standard microprocessors for most
things, since it provides perspective to a broad audience.
Post by Eric Lynum
architectures that are appropriate? Also, what algorithms require these
processors and why?
Many of the algorithms used in bioinformatics are inherently
parallel (perhaps even embarassingly so, as Kevin mentioned earlier).
While they don't _require_ a parallel architecture, they tend to perform
well on systems that exploit parallelism. For more information on how
processing architectures and algorithms work together (i.e. "what
algorithms and why") you might like H.S. Stone's "High-Performance
Computer Architecture" (Addison-Wesley) or D. Cutler, J.P. Singh and A.
Gupta's "Parallel Computer Architecture : A Hardware/Software Approach"
(Morgan Kaufmann).
--
. Dr. Scott Harper
. Adaptive Genomics Corp.
. 620 N. Main St, Suite 103
. Blacksburg, VA 24060
. ***@AdaptiveGenomics.com, 540-552-2700
Jeff
2006-05-20 15:46:49 UTC
Permalink
Post by Eric Lynum
I'd like to know have there been any new computational architectures that
people are looking at today other than multiprocessing? As a hardware
engineer interested in the field of Bioinformatics since I'm not in the
field, it seems from reading the posts that Cray machines and Opteron
processors are used? Why is this, are there no other computational
architectures that are appropriate? Also, what algorithms require these
processors and why?
There are some architectures where you can simulate a neuron by using
capacitors, resistors and other electronic elements to simulate neurons in
VLSI chips. I think they did this at Caltech.

There are some parallel computers that are well-suited to brain simulation.

Jeff

Loading...