Discussion:
IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance
(too old to reply)
A***@gmail.com
2006-09-06 01:00:41 UTC
Permalink
http://news.zdnet.com/2100-9584_22-6112439.html

IBM to build Opteron-Cell hybrid supercomputer
By Stephen Shankland, CNET News.com
Published on ZDNet News: September 5, 2006, 1:12 PM PT



IBM has won a bid to build a supercomputer called Roadrunner that will
include not just conventional Opteron chips but also the Cell processor
used in the Sony Playstation, CNET News.com has learned.

The supercomputer, for the Los Alamos National Laboratory, will be the
world's fastest machine and is designed to sustain a performance level
of a "petaflop," or 1 quadrillion calculations per second, said U.S.
Sen. Pete Domenici earlier this year. Bidding for the system opened in
May, when a congressional subcommittee allocated $35 million for the
first phase of the project, said Domenici, a Republican from New
Mexico, where the nuclear weapons lab is located.

Now sources familiar with the machine have said that IBM has won the
contract and that the National Nuclear Security Administration is
expected to announce the deal in coming days. The system is expected to
be built in phases, beginning in September and finishing by 2007 if the
government chooses build the full petaflop system.

There's plenty of competition in the high-end supercomputing race,
though. Japan's Institute of Physical and Chemical Research, called
RIKEN, announced in June that it had completed its Protein Explorer
supercomputer. The Protein Explorer reached the petaflop level, RIKEN
said, though not using the conventional Linpack supercomputing speed
test.

Representatives of IBM and Los Alamos declined to comment for this
story. The NNSA, which oversees U.S. nuclear weapons work at Los Alamos
and other sites, didn't immediately respond to a request for comment.

Hybrid supercomputers
The Roadrunner system, along with the Protein Explorer and the
seventh-fastest supercomputer, Tokyo Institute of Technology's Tsubame
system built by Sun Microsystems, illustrate a new trend in
supercomputing: combining general-purpose processors with
special-purpose accelerator chips.

"Roadrunner is emphasizing acceleration technologies. Coprocessor
acceleration is intrinsic to that particular design," said John
Gustafson, chief technology officer of start-up ClearSpeed
Technologies, which sells the accelerator add-ons used in the Tsubame
system. (Gustafson was referring to the Roadrunner project in general,
not to IBM's winning bid, of which he disclaimed knowledge.)

IBM's BladeCenter systems are amenable to the hybrid approach. A single
chassis can accommodate both general-purpose Opteron blade servers and
Cell-based accelerator systems. The BladeCenter chassis includes a
high-speed communications links among the servers, and one source said
the blades will be used in Roadrunner.

Advanced Micro Devices' Opteron processor is used in supercomputing
"cluster" systems that spread computing work across numerous small
machines joined with a high-speed network. In the case of Roadrunner,
the Cell processor, designed jointly by IBM, Sony and Toshiba, provides
the special-purpose accelerator.

Cell originally was designed to improve video game performance in the
PlayStation 3 console. The single chip's main processor core is
augmented by eight special-purpose processing cores that can help with
calculations such as simulating the physics of virtual worlds. Those
engines also are amenable to scientific computing tasks, IBM has said.

Using accelerators "expands dramatically" the amount of processing a
computer can accomplish for a given amount of electrical power,
Gustafson said.

"If we keep pushing traditional microprocessors and using them as
high-performance computing engines, they waste a lot of energy. When
you get to the petascale regions, you're talking tens of megawatts when
using traditional x86 processors" such as Opteron or Intel's Xeon, he
said.

"A watt is about a dollar a year if you have the things on all the
time," so 10 megawatts per year equates to $10 million in operating
expenses, Gustafson said.

A new partnership
The Los Alamos-IBM alliance is noteworthy for another reason as well.
The Los Alamos lab has traditionally favored supercomputers from
manufacturers other than IBM, including Silicon Graphics, Compaq and
Linux Networx. Its sister lab and sometimes rival, Lawrence Livermore,
has had the Big Blue affinity, housing the current top-ranked
supercomputer, Blue Gene/L.

Los Alamos also houses earlier Big Blue behemoths such as ASC Purple,
ASCI White and ASCI Blue Pacific. (ASCI stood for the Accelerated
Strategic Computing Initiative, a federal effort to hasten
supercomputing development to perform nuclear weapons simulation work,
but has since been modified to the Advanced Simulation and Computing
program.)

Blue Gene/L has a sustained performance of 280 teraflops, just more
than one-fourth of the way to the petaflop goal.

The U.S. government has become an avid supercomputer customer, using
the machines for simulations to ensure nuclear weapons will continue to
work even as they age beyond their original design lifespans. Such
physics simulations have grown increasingly sophisticated, moving from
two to three dimensions, but more is better. Los Alamos expects
Roadrunner will increase the detail of simulations by a factor of 10,
one source said.

For twice-yearly ranking of supercomputers called the Top500 list,
computers are ranked on the basis of a benchmark called Linpack that
measures how many floating-point operations per second--"flops"--it can
perform. Linpack is a convenient but incomplete representation of a
machine's total ability, but it's nevertheless widely watched.

IBM has dominated the Top500 list with its Blue Gene/L supercomputing
designs. But U.S. models haven't always led, and there's been some
international rivalry: A Japanese system, NEC's Earth Simulator, topped
the list for years.

IBM and petaflop computing are no strangers. Although customers can buy
the current Blue Gene/L systems or rent their processing power from
IBM, Blue Gene actually began as a research project in 2000 to reach
the petaflop supercomputing level.
YKhan
2006-09-06 04:27:33 UTC
Permalink
Post by A***@gmail.com
http://news.zdnet.com/2100-9584_22-6112439.html
IBM to build Opteron-Cell hybrid supercomputer
By Stephen Shankland, CNET News.com
Published on ZDNet News: September 5, 2006, 1:12 PM PT
Hmm, I wonder if this is part of AMD's Torenza initiative? That is, is
the Cell processor going to use Coherent Hypertransport links?

Yousuf Khan
George Macdonald
2006-09-06 19:46:42 UTC
Permalink
Post by YKhan
Post by A***@gmail.com
http://news.zdnet.com/2100-9584_22-6112439.html
IBM to build Opteron-Cell hybrid supercomputer
By Stephen Shankland, CNET News.com
Published on ZDNet News: September 5, 2006, 1:12 PM PT
Hmm, I wonder if this is part of AMD's Torenza initiative? That is, is
the Cell processor going to use Coherent Hypertransport links?
And/Or this could be the explanation for AMD taking a paid license to
Rambus IP a while back??
--
Rgds, George Macdonald
Del Cecchi
2006-09-06 21:10:48 UTC
Permalink
Post by George Macdonald
Post by YKhan
Post by A***@gmail.com
http://news.zdnet.com/2100-9584_22-6112439.html
IBM to build Opteron-Cell hybrid supercomputer
By Stephen Shankland, CNET News.com
Published on ZDNet News: September 5, 2006, 1:12 PM PT
Hmm, I wonder if this is part of AMD's Torenza initiative? That is, is
the Cell processor going to use Coherent Hypertransport links?
And/Or this could be the explanation for AMD taking a paid license to
Rambus IP a while back??
I would say "neither" based on the following in the press release..

"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).

The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."

So you can look up the Cell Blades and the 3755 server.
--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”
Yousuf Khan
2006-09-07 03:43:27 UTC
Permalink
Post by Del Cecchi
I would say "neither" based on the following in the press release..
"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).
The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?

Yousuf Khan
Scott Michel
2006-09-07 16:30:12 UTC
Permalink
Post by Yousuf Khan
Post by Del Cecchi
I would say "neither" based on the following in the press release..
"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).
The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.

Having x86_64 around means that you can run a chunk of code using
well-understood tools.

I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.

<shameless plug> Some of the afternoon speakers at my "General-Purpose
GPU: Practice and Experience" workshop will be talking about these very
issues. Workshop's web page is at http://www.gpgpu.org/sc2006/workshop/
</shameless plug>
Chris Thomasson
2006-09-07 23:20:14 UTC
Permalink
Post by Scott Michel
Post by Yousuf Khan
Post by Del Cecchi
I would say "neither" based on the following in the press release..
"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).
The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.
Isn't the instruction-set for the Cell dependent on what memory accesses you
are going to use? Access to local memory vs. accessing remote memory of
sorts...
Scott Michel
2006-09-09 00:17:54 UTC
Permalink
Post by Chris Thomasson
Post by Scott Michel
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.
Isn't the instruction-set for the Cell dependent on what memory accesses you
are going to use? Access to local memory vs. accessing remote memory of
sorts...
No question that data and message orchestration are going to be keeping
compiler researchers very happy for the foreseeable future. Your
question only applies to the SPUs, however. Existing tools will work
just fine on the PPC64 primary processor.

But the original question why both Cell and Opteron...?
Tom Horsley
2006-09-09 22:43:28 UTC
Permalink
Post by Scott Michel
But the original question why both Cell and Opteron...?
Opteron so they can get the performance they need?
Cell because IBM makes 'em and they can unload a bunch
of them on the gummint while they are at it?
(Just a theory :-).
Chris Thomasson
2006-09-08 00:02:20 UTC
Permalink
[...]
Post by Scott Michel
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too.
Some nit picking here, sorry:


What threading intricacies, exactly? FWIW, I address scalability with
lock-free reader patterns and high-performance memory allocators:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/24c40d42a04ee855/c36b50d37c2ebaca?hl=en#c36b50d37c2ebaca


I would not feel intimidated by Niagara.. No special compilers are needed...
Just C, POSIX, and SPARC V9 assembly language will get you outstanding
scalability and throughput characteristics' on UltraSPARC T1...


Any thoughts?



BTS, I would be happy to discuss 64-bit lock-free programming on Niagara...
I have a T2000 and I can assert that all of the "threading intricacies" are
efficiently solved through clever use of lock-free programming...
Scott Michel
2006-09-09 00:24:49 UTC
Permalink
Wouldn't be USENET if there weren't... :-)
Post by Chris Thomasson
What threading intricacies, exactly? FWIW, I address scalability with
Hot lock contention that ends up serializing threads. More of a poor
programming practice in multithreaded applications than a processor
problem. It's something that has to be considered, although compilers
won't necessarily dig one out of that hole.
Post by Chris Thomasson
I would not feel intimidated by Niagara.. No special compilers are needed...
Just C, POSIX, and SPARC V9 assembly language will get you outstanding
scalability and throughput characteristics' on UltraSPARC T1...
I'm not intimidated by Niagara. My agenda is twofold: (a) doing
technology refresh risk assessments for various customers, (b) looking
for the next cool research topic for the next 5-year research epoch.
Lock-free is usually good (personally, I've always been a fan of
LL-SC), but sometimes seemed to lead to pathological conditions.
Pathological conditions are generally bad for embedded or space
systems.
Post by Chris Thomasson
BTS, I would be happy to discuss 64-bit lock-free programming on Niagara...
I have a T2000 and I can assert that all of the "threading intricacies" are
efficiently solved through clever use of lock-free programming...
Cool. Would like to hear more about better practices.
Chris Thomasson
2006-09-09 01:19:38 UTC
Permalink
Post by Scott Michel
Wouldn't be USENET if there weren't... :-)
Post by Chris Thomasson
What threading intricacies, exactly? FWIW, I address scalability with
Hot lock contention that ends up serializing threads.
Yeah... You can distribute the locks with a hash to help out in this area:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?lnk=gst&q=multi-mutex&rnum=1#3ca11e0c3dcf762c


Something like lock-based transactional memory...




FWIW, here are some of my thoughts on transactional memory:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f6399b3b837b0a40/5f4afc338f3dd221?hl=en#5f4afc338f3dd221


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9c572b709248ae64/eefe66fd067bdb67?hl=en#eefe66fd067bdb67


http://groups.google.com/group/comp.programming.threads/msg/7c4f5ba87e36fd79?hl=en


As you can see, I don't like transactional memory very much...

;^(...
Post by Scott Michel
More of a poor
programming practice in multithreaded applications than a processor
problem. It's something that has to be considered, although compilers
won't necessarily dig one out of that hole.
Agreed.
Post by Scott Michel
Post by Chris Thomasson
I would not feel intimidated by Niagara.. No special compilers are needed...
Just C, POSIX, and SPARC V9 assembly language will get you outstanding
scalability and throughput characteristics' on UltraSPARC T1...
I'm not intimidated by Niagara.
Good to hear... Suff like this has me weary of programmers skills wrt
multi-threading:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b192c5ffe9b47926/5301d091247a4b16?hl=en#5301d091247a4b16
(read all)


IEEE fellow seems to think threads are far to complicated for any "normal"
programmer to even begin to grasp...
Post by Scott Michel
My agenda is twofold: (a) doing
technology refresh risk assessments for various customers, (b) looking
for the next cool research topic for the next 5-year research epoch.
Lock-free is usually good (personally, I've always been a fan of
LL-SC),
Yeah.. More on this at *end of msg...
Post by Scott Michel
but sometimes seemed to lead to pathological conditions.
Pathological conditions are generally bad for embedded or space
systems.
Please clarify...

Well, I has been my experience that "loopless" lock-free algorithms are the
best for real time systems... For instance, take a lock-free
single/producer-consumer queue into account... If a real time system is
going to use this queue, it has to have a explicit answer for exactly how
long its push and pop operations will take, no matter what the load of the
system is like... For a lock-free queue to be usable to a hard real-time
system it has to be able to assert that its push operation is loopless and
has exactly X instructions, and its pop operation is loopless and has
exactly X instructions.


Here is an example of my implementation of such a queue:


http://appcore.home.comcast.net/

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/205dcaed77941352/d154b56f0f233cef?hl=en#d154b56f0f233cef


LL/SC does not really fit the bill... You have to implement logic that uses
LL/SC in a loop. You can predict exactly how many times a thread will retry.
Its similar to the live-lock-like situations that are inherent in
obstruction-free algorithms...


Is this the kind 'pathological conditions' you were getting at?
Post by Scott Michel
Post by Chris Thomasson
BTS, I would be happy to discuss 64-bit lock-free programming on Niagara...
I have a T2000 and I can assert that all of the "threading intricacies" are
efficiently solved through clever use of lock-free programming...
Cool. Would like to hear more about better practices.
*Well, read all of this to start off:


http://groups.google.com/group/comp.arch/browse_frm/thread/71f8e0094e353e5/04cb5e2ca2a7e19a?lnk=gst&q=chris+thomasson&rnum=6#04cb5e2ca2a7e19a


Where do you want to go from here?

Humm...
Chris Thomasson
2006-09-09 01:21:55 UTC
Permalink
Ooops!
[...]
Post by Chris Thomasson
LL/SC does not really fit the bill... You have to implement logic that
uses LL/SC in a loop. You can predict exactly how many times a thread will
retry.
^^^^^^^^^^


You CAN'T predict exactly how many times a thread will retry.
Post by Chris Thomasson
Its similar to the live-lock-like situations that are inherent in
obstruction-free algorithms...
Is this the kind 'pathological conditions' you were getting at?
[...]


Sorry for any confusion.
Yousuf Khan
2006-09-08 01:43:29 UTC
Permalink
Post by Scott Michel
Post by Yousuf Khan
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.
But that's quite the hedge, 16 000 Opterons to backup 16 000 Cells?

What I was really getting at was whether there's some particular set of
FP problems that done better on Opteron, while others are done better on
Cell?

Also Cray seems to create systems with management processors, where a
few processors are dedicated to the tasks such as traffic management and
i/o access. Perhaps the Opterons are better at this sort of task than
the Cells?

Speaking of Cray, they seem to be getting very fond of pairing Opterons
with Clearspeed processors now.

Yousuf Khan
YKhan
2006-09-08 12:01:16 UTC
Permalink
Post by Yousuf Khan
Speaking of Cray, they seem to be getting very fond of pairing Opterons
with Clearspeed processors now.
Yousuf Khan
Sorry, instead of Clearspeed that should read, DRC Computers' chips.

DRC Computer Corporation
http://www.drccomputer.com/

I think Sun is packaging Clearspeed chips with their Opterons, rather
than Cray. Lots of choices available I guess.

Yousuf Khan
Scott Michel
2006-09-09 00:31:16 UTC
Permalink
Post by Yousuf Khan
Post by Scott Michel
Post by Yousuf Khan
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.
But that's quite the hedge, 16 000 Opterons to backup 16 000 Cells?
What I was really getting at was whether there's some particular set of
FP problems that done better on Opteron, while others are done better on
Cell?
Cell's single FP is just like nVidia and ATI GPUs: They round to 0
(truncate). This means that you have to resort to iterative
error-correcting algorithms to compensate for the inevitable numerical
drift. You don't want take a significant double FP performance hit on
Cell (LLNL already has a paper out on this that circulated in the
newsgroup a while back.)

It turns out that even with this implementation of single FP and having
to iterate, you're still going to be faster than the double FP unit.
Turns out to be true on Intel's superscalar too.
Post by Yousuf Khan
Also Cray seems to create systems with management processors, where a
few processors are dedicated to the tasks such as traffic management and
i/o access. Perhaps the Opterons are better at this sort of task than
the Cells?
Dunno. My personal opinion is that it's just risk reduction given the
state of the developer tools.
Chris Thomasson
2006-09-08 03:57:00 UTC
Permalink
Post by Scott Michel
Niagara's threading intricacies too
[...]

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b192c5ffe9b47926/5301d091247a4b16?hl=en#5301d091247a4b16

(read all)
Sander Vesik
2006-09-08 21:03:24 UTC
Permalink
Post by Scott Michel
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
--
Sander

+++ Out of cheese error +++
Thor Lancelot Simon
2006-09-09 15:39:50 UTC
Permalink
Post by Sander Vesik
Post by Scott Michel
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
That sounds about right to me.

Even for traditional SIMD MPP designs compilers don't do a very good
job with naive code, and those designs (and compilers targeting them)
have been around for decades now. Conversely, there have been several
languages and language extensions targeting such hardware that do let
programmers write good parallel code with a minimum of pain -- which
almost nobody has adopted. In my own personal experience, llc and mpc
from the mid-1980s come immediately to mind: simple extensions to C
giving parallel datatypes and operations, with an open-source compiler,
which nobody outside of one small research group ever adopted. Such
a language might be well suited to things like Cell -- but don't count
on anyone ever learning to use it.
--
Thor Lancelot Simon ***@rek.tjls.com

"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart
Tom Horsley
2006-09-09 22:40:26 UTC
Permalink
Post by Sander Vesik
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
Hey, the compiler writers had all their brain cells used up
trying to generate code for the x86 architecture.
You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :-).
a?n?g?e? (The little lost angel)
2006-09-10 02:30:05 UTC
Permalink
On Sat, 09 Sep 2006 22:40:26 GMT, Tom Horsley
Post by Tom Horsley
You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :-).
That's bull, it's web design and JAVA ;)
--
A Lost Angel, fallen from heaven
Lost in dreams, Lost in aspirations,
Lost to the world, Lost to myself
Scott Michel
2006-09-11 16:59:58 UTC
Permalink
Post by Sander Vesik
Post by Scott Michel
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
gcc doesn't really help you if you don't know what you're doing. Loop
unrolling comes to mind: can't tell you how many times I've had to
forcibly do loop unrolling where one would have expected gcc to do it
with "-O3 -funroll-loops".

There is some hope on the horizon, like LLVM from UIUC, which you'll
see underneath hood in OS X "Leopard". I'm not sure if I'd expect to
see Cell SPU support in Java, although IBM will likely make that
happen. Sure, compilers can take hints, but it seems to me that an
interpretive system, like LLVM or Python, to take the "bird's eye" view
and dispatch tasks to SPUs. Simple loop-level parallelism, while
common, is likely the wrong level of granularity.
Robert Redelmeier
2006-09-12 13:15:34 UTC
Permalink
Post by Scott Michel
gcc doesn't really help you if you don't know what you're doing.
Agreed `gcc` can be cantankerous.
Post by Scott Michel
Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

-- Robert
Scott Michel
2006-09-13 16:37:40 UTC
Permalink
Post by Robert Redelmeier
Post by Scott Michel
gcc doesn't really help you if you don't know what you're doing.
Agreed `gcc` can be cantankerous.
Post by Scott Michel
Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
Robert Redelmeier
2006-09-13 17:59:22 UTC
Permalink
Post by Scott Michel
Post by Robert Redelmeier
Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.

40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.

-- Robert
Scott Michel
2006-09-14 17:11:27 UTC
Permalink
Post by Robert Redelmeier
Post by Scott Michel
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.
40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.

gcc is not your friend.
Phil Armstrong
2006-09-14 18:50:55 UTC
Permalink
Post by Scott Michel
Post by Scott Michel
I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new[i] = y_old[i] + alpha * x[i]" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)
[snip]
Post by Scott Michel
gcc is not your friend.
Was the loop not being unrolled at all by gcc? Did -funroll-all-loops
help?

Phil
--
http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt
Bernd Paysan
2006-09-15 12:22:15 UTC
Permalink
Post by Scott Michel
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.
gcc is not your friend.
More than 10 years ago, when I still was a student, one of the PhDs of the
numeric faculty made a matrix multiply competition for the HP-RISC CPUs we
had on our workstations. He estimated that 30MFLOPs would be possible, even
though a naive C loop could get less than 1MFLOP, and the HP Fortran
compiler with a build-in "extremely fast" matrix multiplication got no more
than 10MFLOPs.

After doing some experiments, I got indeed 30MFLOPs out of the thing, by
doing several levels of blocking. The inner loop kept a small submatrix
accumulator (as much as did fit, I think I got 5x5 into the registers), so
that several rows and columns could be multiplied together in one go
(saving a lot of loads and stores). The next blocking level was the (quite
large) cache of the PA-RISC machine, i.e. subareas of both matrixes where
multiplied together.

I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries. GCC isn't your friend, because loop unrolling here is really the
wrong approach. The inner loop I used just did all the multiplications for
the 5x5 submatrix, and no further unrolling was necessary.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Greg Lindahl
2006-09-15 18:01:16 UTC
Permalink
Post by Bernd Paysan
I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries.
And in state of the art compilers. While calling Atlas or whatever is
generally fastest, it's nice for apps that don't use BLAS to get a
speedup from the compiler. And blocking is also useful for loops which
aren't matrix multiply; BLAS won't help you there, but the compiler can.

A tier-1 company did an evaluation of our compiler a long time ago and
had a fun comparison: 3 mat muls: 1 cache blocked, 1 naive, and 1 which was
almost naive but wrote the output in sequential order.

With our compiler, the naive version was fastest, because the cache
blocked version had picked the wrong blocking size for the cpu it was
running on. But there was only about a 10% difference among the 3.

With competing compilers, the cache blocked version was fastest, and
the naive version was much slower.

-- greg
(employed by, not speaking for, QLogic/PathScale.)

Rick Jones
2006-09-07 17:28:03 UTC
Permalink
Might one describe a Cell as vector processor(s) on a chip?

rick jones
--
No need to believe in either side, or any side. There is no cause.
There's only yourself. The belief is in your own precision. - Jobert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Greg Lindahl
2006-09-07 16:26:24 UTC
Permalink
Post by Yousuf Khan
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?
In the current generation, Cell is great on single precision but hard
to compile for. Opteron is easy to compile for, and does
double-precision, too.

Reduced cross-posting.

-- greg
Loading...