What do you want from a parallel debugger?

Discussion:

(too old to reply)

KSG

2003-08-04 02:22:31 UTC

I've been doing investigation on parallel debuggers recently and one
of the big questions I have is: What do MPI users want in their
debugger? Are there any debuggers that people find satisfactory or
are they all lacking in one thing or another?

In terms of features, obviously process and thread-level breakpoints
are invaluable, but what else beyond that?

What are visualizations that one would find useful?
* Message queues
* Timeline of events?
* Distributed object display?
Are features like "Debug insertion of barriers" useful?
What about keeping message logs with some partial ordering of
messages?
Additional things that may have never even crossed my mind?

I'm finding that debugging parallel apps is ridiculously difficult
compared to serial apps, and I'd like to see if there are things that
could be done to make it easier.

Thanks in advance!

KSG

Alexander Supalov

2003-08-04 07:25:38 UTC

Permalink

Hi!

Post by KSG
I've been doing investigation on parallel debuggers recently and one
of the big questions I have is: What do MPI users want in their
debugger?

I hope to be not too wide off the mark by saying that most people would
love to have approximately what TotalView provides, but at a fraction of
the cost and with a tad less clumsy GUI. Any (properly functioning)
extra features are good, but unless a new debugger fails on the above
two counts, it won't fly, or so I think.

Best regards.

Alexander

--
Dr Alexander Supalov
Senior Software Engineer
--------------------------------------------------------------------
//// pallas / A Member of the ExperTeam Group
Pallas GmbH / Hermuelheimer Str. 10 / 50321 Bruehl / Germany
***@pallas.com / www.pallas.com
Tel +49-2232-1896-34 / Fax +49-2232-1896-29
--------------------------------------------------------------------

KSG

2003-08-04 16:40:31 UTC

Permalink

Post by Alexander Supalov
Hi!

Post by KSG
I've been doing investigation on parallel debuggers recently and one
of the big questions I have is: What do MPI users want in their
debugger?

I didn't realize that price was such a big concern -- then again I
don't actually know what Totalview's price is (and oddly it's not on
their webpage).

I must admit I'm surprised to hear your statement (that Totalview
approximates what most people want) as some people have told me in the
past that parallel debugging is not in a great state, although this
was a time before I was pressing for more details as to why.

And a clumsy GUI can destroy any experience (either a GUI that
responds too slowly, has poor layout, or simply feels like it's always
about to break). I agree that this is a given for any good GUI
application.

KSG

RxR

2003-08-04 23:17:45 UTC

Permalink

TotalView charges by the number of machines. In a PC Beowulf
cluster you pay for x machines even though it's essentially just one.
I once tried an evaluation copy and found it OK once you get used
to it, but couldn't justify the high price they quoted.

It would be nice if a parallel debugger could also do profiling to help
identify where bottlenecks occur.

Post by KSG

Post by Alexander Supalov
Hi!

Post by KSG
I've been doing investigation on parallel debuggers recently and one
of the big questions I have is: What do MPI users want in their
debugger?

I didn't realize that price was such a big concern -- then again I
don't actually know what Totalview's price is (and oddly it's not on
their webpage).
I must admit I'm surprised to hear your statement (that Totalview
approximates what most people want) as some people have told me in the
past that parallel debugging is not in a great state, although this
was a time before I was pressing for more details as to why.
And a clumsy GUI can destroy any experience (either a GUI that
responds too slowly, has poor layout, or simply feels like it's always
about to break). I agree that this is a given for any good GUI
application.
KSG

Bruce Scott TOK

2003-08-06 17:12:31 UTC

Permalink

RxR wrote:

|> TotalView charges by the number of machines. In a PC Beowulf
|> cluster you pay for x machines even though it's essentially just one.
|> I once tried an evaluation copy and found it OK once you get used
|> to it, but couldn't justify the high price they quoted.

Charge per machine is naturally a killer. Sounds stupid as well. Are
they arrogant or just ignorant? Time was, someone from TV would peek in
these newsgroups and put questions in just this sort of direction...

That was before the current golden age of clusters, though.

|> It would be nice if a parallel debugger could also do profiling to help
|> identify where bottlenecks occur.

Lots of platforms have in-house stuff for this (example, IBM).

--
cu,
Bruce

drift wave turbulence: http://www.rzg.mpg.de/~bds/

Crash N Burn

2003-08-08 22:19:14 UTC

Permalink

Post by RxR
TotalView charges by the number of machines. In a PC Beowulf
cluster you pay for x machines even though it's essentially just one.

Other companies selling "real software" to corporations who can not or
will not employ legions of systems developers do some of the following -
when you get out of the free/university/research world and get p/l
responsibility life looks very different regarding software. The following
are all actual software examples -

A hierarchical storage manager - $US200,000 up front for ~200 TB under
management, plus a 20% per year maintenance charge. Based on the TB
under management - if you make 2 copies of archival tapes that is double
under management. Was not the purpose of 2 copies to protect in case
one got trashed? ie the above is really a 100 TB license. They tell me this
provides value. Another package with 75% of the capability (the 75%
I really need) that works just fine is only $30,000 all up. You can guess
which we use. :-)

A tape library company - you get 350 mounts per day free and then must upgrade
the operating license as well as the maintenance to load more. Ever buy a car
where a taxi meter started running at 12,000 miles a year and if you did not insert
your visa card the car would stop? (the library does not really stop, but your friendly
sales staff will call the matter to your attention).

Compiler on a box with 32 CPU - $US3000. Same box operated as 8
LPARs, $US12,000. Now operate it as 16 LPARs and it goes to $US48,000.
Truly arbitrary! Justification, each LPAR has its own OS and disk and is thus
a machine.

A math library on a very powerful 16 CPU system, $$US8000 per annum for the
research license, but change that to a machine with ~150 cpu that are each only
25% as powerful and pay closer to $US200,000. Priced per CPU irregardless of
performance. My users found an acceptable alternative for but a small fraction
of that, after substantial prodding. Not sure if we are saving $8 or $200K when
all is said and done...and yup, the alternative has similar average performance and
all the routines my users need.

As another post noted, it is all negotiable - in many cases very negotiable indeed,
especially when you throw the software out of your site. I tell the salespeople - so
you tried to rape my budget and I threw you out, so now you are offering me a deal
to keep you? So we are going to do this again next year, right? Good bye.

Etnus is not out of line with their industry, but their industry has become as arbitrary
as the airlines in what and how they charge.

In the 1980s you paid $US30 million for a supercomputer and maybe $5 million
for software. Now you pay $US5 million for the supercomputer and $30 million
for software... what a great savings.... I wish there was a group called
comp.stupid.vendor.practices.venting <G>

Alexander Supalov

2003-08-05 08:21:20 UTC

Permalink

Hi!

Post by KSG
I must admit I'm surprised to hear your statement (that Totalview
approximates what most people want) as some people have told me in the
past that parallel debugging is not in a great state, although this
was a time before I was pressing for more details as to why.

Debugging in general is still more of an art than science, and as such
it will never be in a particularly great shape on average. But one can
fare pretty well with what TV has to offer once one gets used to the
logic of this tool.

If you ask me, a debugger is anyway (or should be) a tool of last
resort. First of all, there should be no bugs to start with (TM). If
they just dare to creep in, the built-in program tracing facility should
help in eliminating them at the source. And only if something goes badly
astray (e.g., core is dumped mysteriously) may a debugger be useful.

There remains a large gray area of synchronization- and timing-related
problems that can only be attacked using (nearly) nonintrusive
monitoring techniques. And on the other side of the spectrum, some gross
algorithmic errors can sometimes be tracked down with the little help of
the MPI profiling tools.

Post by KSG
And a clumsy GUI can destroy any experience (either a GUI that
responds too slowly, has poor layout, or simply feels like it's always
about to break). I agree that this is a given for any good GUI
application.

The new TV's Motif interface is quite stable (I've not seen it breaking
yet, that is). It's just very heavyweight and sometimes quite slow even
on a very fast machine. And one has to learn one's way around before
some of the defaults and menu items start making sense.

But otherwise it's a nice product, the only huge drawbacks of which are
the out-of-touch pricing policy and a couple of funny choices about the
supported platforms. I just hope they know what kind of entry
opportunity they open up by persisting in their archaic decisions.

Best regards.

Alexander

Joachim Worringen

2003-08-05 08:39:41 UTC

Permalink

Seen from another perspective: HPC business is a machine for
turning tax money into subventions for the own homegrown
industry. I mean, why are the Japanese vector computers
used solely in Japan? Do Japanese institutions have vector
problems to solve while the rest of the world has scalar
problems to solve.

Well, I have to object here. While it is true that the "concentration" of
vector computers is higher in Japan than in other places of the world, it
can be observed that the better suitability of vector computers for a large
class of HPC problems is increasingly recognized outside of Japan, too
(leading to installations of new systems like recently in Australia,
England, Brazil and also Germany).

And TotalView is available for these systems, too... but don't ask about
pricing.

Joachim

--
reply to joachim at domain ccrl-nece dot de

Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.

Bruce Scott TOK

2003-08-06 17:10:32 UTC

Permalink

KSG wrote:

|> Alexander Supalov <***@pallas.com> wrote in message news:<***@pallas.com>...

|> > I hope to be not too wide off the mark by saying that most people would
|> > love to have approximately what TotalView provides, but at a fraction of
|> > the cost and with a tad less clumsy GUI. Any (properly functioning)
|> > extra features are good, but unless a new debugger fails on the above
|> > two counts, it won't fly, or so I think.

That's about it... TV will do simple stuff so you find things got to the
right cells on the right PE and exactly where things hang when they
hang. Beyond that though I still have to use write statements and so
therefore if I had to _pay_ for TV I wouldn't (here, we have or had some
sort of site license).

|> I didn't realize that price was such a big concern -- then again I
|> don't actually know what Totalview's price is (and oddly it's not on
|> their webpage).

They are apparently _very_ spendy...

|> I must admit I'm surprised to hear your statement (that Totalview
|> approximates what most people want) as some people have told me in the
|> past that parallel debugging is not in a great state, although this
|> was a time before I was pressing for more details as to why.
|>
|> And a clumsy GUI can destroy any experience (either a GUI that
|> responds too slowly, has poor layout, or simply feels like it's always
|> about to break). I agree that this is a given for any good GUI
|> application.

Totalview is only any good when you can run it on the platform your code
runs on, and that requires a local interactive login possibility not
everyone has.

I think for really serious codes a lot of people still use write
statements...

--
cu,
Bruce

drift wave turbulence: http://www.rzg.mpg.de/~bds/

Bruce Scott TOK

2003-08-11 13:06:22 UTC

Permalink

KSG wrote:

|> Bruce Scott TOK <Use-Author-Supplied-Address-Header@[127.1]> wrote in message news:<***@ipp.mpg.de>...
|>
|> > Totalview is only any good when you can run it on the platform your code
|> > runs on, and that requires a local interactive login possibility not
|> > everyone has.
|>
|> So are you saying you'd like to debug traces of past runs? I haven't
|> really considered debugging non-interactive jobs, but maybe this is
|> something that people would like to debug.

No, I like to debug interactively, walking through the code (of course,
with problem sizes small enough to do it) and looking at the various
array pieces as I go. This requires interactive login privilege, which
not everyone on such systems has.

--
cu,
Bruce

drift wave turbulence: http://www.rzg.mpg.de/~bds/

Adrian

2003-08-14 05:38:14 UTC

Permalink

Most of the time, when I'm debugging I need something
lightweight, fast and reliable. That generally means
no GUI. In fact, the tool I found most useful by a long
shot was "debugview" on Unicos/mk. Too bad none of
the current parallel systems have anything that comes
close to that.

Please dont add any more features before you can reliably
launch and debug 10^3 tasks from the command-line or
examine the core from a similar job.

It seems to me that the best debugger will come from
a system that recognizes a parallel job as a true
OS entity (not some loose collection of processes).
Ooops, I guess that means a true distributed OS, no
harm in dreaming....

Adrian

Post by KSG
I've been doing investigation on parallel debuggers recently and one
of the big questions I have is: What do MPI users want in their
debugger? Are there any debuggers that people find satisfactory or
are they all lacking in one thing or another?
In terms of features, obviously process and thread-level breakpoints
are invaluable, but what else beyond that?
What are visualizations that one would find useful?
* Message queues
* Timeline of events?
* Distributed object display?
Are features like "Debug insertion of barriers" useful?
What about keeping message logs with some partial ordering of
messages?
Additional things that may have never even crossed my mind?
I'm finding that debugging parallel apps is ridiculously difficult
compared to serial apps, and I'd like to see if there are things that
could be done to make it easier.
Thanks in advance!
KSG