Oprofile callgraph question

Hi all,

I am profiling gnuradio-3.1.1/gr-atsc/src/python/ example.
I’ve already collected seinfeld.dat (the captured data from fox TV).

opcontrol --setup --vmlinux=my_path_to_vmlinux/vmlinux
opcontrol --start
opcontrol --reset

run atsc example ( *.py files)

xterm -e ./interp.py seinfeld.dat &
xterm -e ./xlate.py
xterm -e ./btl-fsd.py
xterm -e ./viterbi-out.py seinfeld.mpg

opcontrol --save=my_session_name
opcontrol --dump
opcontrol --stop
opreport -l session:my_session_name

So I do get the seinfeld.mpg correctly.
And I can actually play with xine.
Also I do get the profiled result.

Here are top 10 application and symbols that consumed resources.

atsci_equalizer_lms 26.466%
.loop from libgnuradio-core.so.0.0.0 12.52%
atsci_single_viterbi 10.5%
.cleanup from libgnuradio-core.so.0.0.0 6.7%
__ieee754_atan2 from libm-2.5.so 5.53 %
.loop2 from libgnuradio-core.so.0.0.0 5.5%
memcpy from libc-2.5 so 3.03%
gr_freq_xlating_fir_filter_ccf 2.68%
atsc_fpll 1.8%


After this point, I wanted to find out who called .loop and .cleanup and
.loop2, etc.
So I tried callgraph function for oprofile.

I recomplied gnuradio code with
./configure --with-gprof (since this sets -pg option)

And then did
opcontrol --setup --vmlinux=my_path_to_vmlinux/vmlinux
opcontrol --start
opcontrol --reset
opcontrol --callgraph=5

run atsc example ( *.py files)

xterm -e ./interp.py seinfeld.dat &
xterm -e ./xlate.py
xterm -e ./btl-fsd.py
xterm -e ./viterbi-out.py seinfeld2.mpg

opcontrol --save=my_session_name
opcontrol --dump
opcontrol --stop
opreport -cl --demangle=smart session:my_session_name

I do get seinfeld2.mpg correctly.
But opreport does not give me any depth.
It just says every symbol calls [self]

Do you have any idea what I should do?
The machine I tried is 64bit server with 4 cores.
for which uname -a gives
Linux 2.6.18-53.1.19.el5 #1 SMP Tue Apr 22 03:01:10 EDT 2008 x86_64
x86_64
x86_64 GNU/Linux

Any suggestion will be greatly appreciated.

Thanks!

Mikie

On Fri, Jun 20, 2008 at 04:32:35PM -0400, Mikyung H. wrote:

xterm -e ./interp.py seinfeld.dat &
And I can actually play with xine.
memcpy from libc-2.5 so 3.03%
gr_freq_xlating_fir_filter_ccf 2.68%
atsc_fpll 1.8%

.loop and .loop2 are the inner loops of the SSE filter kernels.
They’ll be called from gr_fir_XXX.cc, which is called from
gr_fir_filter_XXX.cc

One of my developer branches has these unfortunate labels renamed to
something more enlightening. Those changes will eventually make it to
the trunk.

BTW, I think that you’ll find the ATSC code in the trunk is
considerably more up-to-date than that in 3.1.1. Chuck Swiger has
been working on it.

Eric

Thanks eric,
Well, yes it will be helpful to have better symbol names,
but the core problem is doing callgraph.
Is there a way that I can figure out .loop and .loop2 is called?

How do I know that gr_fit_xxx.cc, gr_fir_filter_xxx.cc calls .loop and
.loop2?

Thanks!

Mikie

On Fri, Jun 20, 2008 at 06:03:51PM -0400, Mikyung H. wrote:

Thanks eric,
Well, yes it will be helpful to have better symbol names,
but the core problem is doing callgraph.
Is there a way that I can figure out .loop and .loop2 is called?

How do I know that gr_fit_xxx.cc, gr_fir_filter_xxx.cc calls .loop and
.loop2?

Thanks!
Mikie

Uhhh, because I wrote the code and told you :wink:

Additional challenges in the code you’re looking at are that there are
multiple symbols named .loop and .loop2, and that oprofile’s default
report merges them.

To find them, grep in gnuradio-core/src/lib/filter/*.S

You could also use a static analysis tool (grep would work) to find
all the potential callers. I haven’t had any success with oprofile
and the call graph, but I haven’t spend more than a few minutes with
it. If you use the gprof options, you’re going to be running
instrumented code.

I’d ask over on the oprofile mailing list. If you get an answer,
please let us know.

Eric

On Sat, Jun 21, 2008 at 11:01:22AM -0400, Charles S. wrote:

and allows the pll to run at 16MHz instead of 19.2MHz. An open question
(beyond me at the moment) is what are the optimal interpolator taps in
the bit_timing_loop for 16MHz? I’ll check in what I have

Please do this on a branch.

and people can work with the taps since they are elsewhere in
gnuradio-core. I made some taps:

static const int NTAPS = 12;
static const int NSTEPS = 128;
static const double BANDWIDTH = 0.25;

that work but that’s just ‘twiddling knobs and got a picture’ , I can’t
rigorously prove those are the best numbers :wink:

You’re still basically hosed if you leave BANDWIDTH at 0.25.

Our bandwidth of interest is 6MHz (TV channel) and you want to run at
16MHz, you’d need it to be 0.375. With regard to NTAPS, you’d want to
look at the error between the ideal and actual transfer function over
the frequency range of interest. IIRC the code in gen_interpolator_taps
computes this as its metric. Keep NTAPS a a multiple of 4 since the
SIMD code is going to round it up to that anyway (padding with zeros).

Didn’t you try the J.O. Smith resampler? How did that work out?

Please also consider testing out Achilleas’ suggestion:

further processing. I believe this works (i didn’t see any point where
there is a possibility for aliasing) and it can result in some savings in
complexity without sacrifising performance.
Hopefully I didn’t miss anything big…

Take a look at the block diagram here:

http://www.eecs.umich.edu/~anastas/gnuradio/pll.png

Achilleas

There’s a simple modification that can be made to the EQ to
drastically the CPU requirements. That would be to use the fast
fir code in the body of the frame, and run the slow one (that’s doing
the adaptation) only on the training symbols. Please don’t make this
change until the basic architecture of the CPLL is sorted out and that
the receiver is working well even on so-so signals.

Also, we may actually want to drive up the EQ’s CPU requirements by
going to a decision feedback strategy.

40527 1.8594 libgnuradio-core.so.0.0.0 decode_rs_char
40256 1.8469 libgnuradio-core.so.0.0.0
gr_single_pole_iir<std::complex, std::complex,
double>::filter(std::complex)
36013 1.6523 _atsc.so atsc_cpll::work(int,
std::vector<void const*, std::allocator<void const*> >&,
std::vector<void*, std::allocator<void*> >&)
31789 1.4585 libm-2.6.so __ieee754_rem_pio2f
29662 1.3609 libm-2.6.so sincosf

–Chuck

Eric

On Fri, 2008-06-20 at 13:45 -0700, Eric B. wrote:

considerably more up-to-date than that in 3.1.1. Chuck Swiger has
been working on it.

Eric

Ah, I still need to check in the complex pll stuff that gets rid of

gr_freq_xlating_fir_filter_ccf 2.68%

and allows the pll to run at 16MHz instead of 19.2MHz. An open question
(beyond me at the moment) is what are the optimal interpolator taps in
the bit_timing_loop for 16MHz? I’ll check in what I have and people can
work with the taps since they are elsewhere in gnuradio-core. I made
some taps:

static const int NTAPS = 12;
static const int NSTEPS = 128;
static const double BANDWIDTH = 0.25;

that work but that’s just ‘twiddling knobs and got a picture’ , I can’t
rigorously prove those are the best numbers :wink:

Using cpll the oprofile top 10 looks like:

samples % app name symbol name
535275 24.5581 _atsc.so
atsci_equalizer_lms::filter1(float const*)
467354 21.4419 libgnuradio-core.so.0.0.0 .loop1
277648 12.7383 _atsc.so
atsci_single_viterbi::decode(float)
246879 11.3267 libgnuradio-core.so.0.0.0 .loop1
64547 2.9614 libgnuradio-core.so.0.0.0 gr_fast_atan2f(float, float)
40527 1.8594 libgnuradio-core.so.0.0.0 decode_rs_char
40256 1.8469 libgnuradio-core.so.0.0.0
gr_single_pole_iir<std::complex, std::complex,
double>::filter(std::complex)
36013 1.6523 _atsc.so atsc_cpll::work(int,
std::vector<void const*, std::allocator<void const*> >&,
std::vector<void*, std::allocator<void*> >&)
31789 1.4585 libm-2.6.so __ieee754_rem_pio2f
29662 1.3609 libm-2.6.so sincosf

–Chuck

Mikyung H. wrote am 2008-06-21 00:03:

Thanks eric,
Well, yes it will be helpful to have better symbol names,
but the core problem is doing callgraph.
Is there a way that I can figure out .loop and .loop2 is called?

How do I know that gr_fit_xxx.cc, gr_fir_filter_xxx.cc calls .loop and
.loop2?

There is some option to let oprofile comment the code with the
statistics numbers. Then you can ‘grep’ over the tree, find all calls
and compare the call statistics for each of them. From the context it
might be easier to find the right called object if names collide.

That’s how I did it, but I just played with it a little, and that was 2
years ago.

Patrick

Engineers motto: cheap, good, fast: choose any two
Patrick S.
Student of Telematik, Techn. University Graz, Austria

In r8669 in the trunk, the .loop1 and .loop2 labels in the SIMD code
have been renamed to include the name of the file they’re contained
in. This should make opreport’s default output more useful.

Eric

Thanks for the quick reflection, Eric!

Mikie

On Sat, 2008-06-21 at 11:07 -0700, Eric B. wrote:

On Sat, Jun 21, 2008 at 11:01:22AM -0400, Charles S. wrote:

the bit_timing_loop for 16MHz? I’ll check in what I have

Please do this on a branch.

Ok.

static const double BANDWIDTH = 0.25;

You’re still basically hosed if you leave BANDWIDTH at 0.25.
Our bandwidth of interest is 6MHz (TV channel) and you want to run at
16MHz, you’d need it to be 0.375.

Oh, so thats what it means! (Remember the thing about frequency in
radians/sample :wink:

With regard to NTAPS, you’d want to
look at the error between the ideal and actual transfer function over
the frequency range of interest. IIRC the code in gen_interpolator_taps
computes this as its metric. Keep NTAPS a a multiple of 4 since the
SIMD code is going to round it up to that anyway (padding with zeros).

Didn’t you try the J.O. Smith resampler? How did that work out?

Yes - I make a block called gr_samplerate_ff linked against Erik de
Castro’s libsamplerate and it worked fine with audio - dial tone at
bizzare sample rates came out sounding normal to the ear - but it just
did not work for the bit_timing_loop. What with 1) an external
dependancy, 2) it would use up to 25% cpu just at audio rates on a
1.5Ghz Pentium-M, I just moved on to trying new taps for your existing
fractional resampler.

–Chuck

BTW, any suggestion for finding out who caused calling the symbols such
as
__ieee754_atan2
and __kernel_cosf, __kernel_sinf, __ieee754_rem_pio2f from libm-2.5.so?

Also .cleanup?

Thanks!
Mikie

On Mon, Jun 23, 2008 at 05:42:16PM -0400, Mikyung H. wrote:

Also .cleanup?

I just make it a local symbol. You shouldn’t see it in opreport any
more.

Eric

On Mon, Jun 23, 2008 at 05:42:16PM -0400, Mikyung H. wrote:

BTW, any suggestion for finding out who caused calling the symbols such as
__ieee754_atan2
and __kernel_cosf, __kernel_sinf, __ieee754_rem_pio2f from libm-2.5.so?

What kind of a machine are you running on?

The first is of course the guts of atan2. I’m surprised to see it.
I think the only place it could be getting called is by way of the
complex arg(z) function. We’re replace all the others with calls to
gr_fast_atan2

[eb@octo trunk]$ dgs . '\barg[ \t](’
./gnuradio-core/src/lib/general/gr_remez.cc: * Change: ColumnVector
x=arg(i).vector_value();
./gnuradio-core/src/lib/general/gr_remez.cc: * to: ColumnVector
x(arg(i).vector_value());
./gnuradio-core/src/lib/general/gr_mpsk_receiver_cc.cc: return
-arg(sample
conj(d_constellation[d_current_const_point]));
./gnuradio-core/src/lib/general/gr_ofdm_frame_sink.cc: float angle =
arg(accum_error);
./gnuradio-core/src/lib/general/gr_complex_to_xxx.cc: // out[i] =
std::arg (in[i]);
./gnuradio-core/src/lib/general/gr_quadrature_demod_cf.cc: // out[i]
= d_gain * arg (product);

The last 3 are from the guts of cosf and sinf.

What’s your test case?

Also .cleanup?

.cleanup is a just a label toward the bottom of the SIMD code.
It’s unlikely you’re getting any serious amount of hits there, it’s
not in a loop.

Eric

On Mon, Jun 23, 2008 at 06:53:07PM -0400, Mikyung H. wrote:

uname -a gives
2.6.18-53.1.19.el5 #1 SMP Tue Apr 22 03:01:10 EDT 2008 x86_64 x86_64 x86_64
GNU/Linux

When inserting the output of opreport in email, can you please keep
your mail tool from wrapping it?

I guess I need to study the mapping from python code to c files and
assembly codes. What would you suggest doing, reading, etc in order
to figure out this mapping and skeleton of ofdm example? My
eventual goal is to firgure out which piece of code is taking most
of the resources for ofdm.

Try using grep and your favorite editor, or tags files and your
favorite editor. The python to C++ naming is generally
straight-forward:
gr.foo -> gr_foo.cc.

Eric

On Mon, Jun 23, 2008 at 6:37 PM, Eric B. [email protected] wrote:

On Mon, Jun 23, 2008 at 05:42:16PM -0400, Mikyung H. wrote:

BTW, any suggestion for finding out who caused calling the symbols such
as
__ieee754_atan2
and __kernel_cosf, __kernel_sinf, __ieee754_rem_pio2f from libm-2.5.so?

What kind of a machine are you running on?

uname -a gives
2.6.18-53.1.19.el5 #1 SMP Tue Apr 22 03:01:10 EDT 2008 x86_64 x86_64
x86_64
GNU/Linux

The first is of course the guts of atan2. I’m surprised to see it.
I think the only place it could be getting called is by way of the
complex arg(z) function. We’re replace all the others with calls to
gr_fast_atan2

Yeah, I think I used 3.1.1 for atsc example.

./gnuradio-core/src/lib/general/gr_complex_to_xxx.cc: // out[i] =
std::arg (in[i]);
./gnuradio-core/src/lib/general/gr_quadrature_demod_cf.cc: // out[i] =
d_gain * arg (product);

The last 3 are from the guts of cosf and sinf.

What’s your test case?

Besides, atsc example, I tried profiling benchmark_ofdm_rx/tx.py

  • profiling benchmark_ofdm_rx.py now gives

samples % app name symbol name
226875 22.8492 libgnuradio-core.so.0.0.0
.fcomplex_dotprod_sse64_loop2
79310 7.9875 libgnuradio-core.so.0.0.0 .float_dotprod_sse64_loop2
61142 6.1578 libm-2.5.so __ieee754_rem_pio2f
57980 5.8393 libm-2.5.so __kernel_rem_pio2f
31631 3.1856 libgnuradio-core.so.0.0.0
gr_fir_ccf_simd::filter(std::complex const*)
30052 3.0266 libm-2.5.so __kernel_sinf
26588 2.6778 libm-2.5.so sincosf
25214 2.5394 libm-2.5.so __kernel_cosf
23153 2.3318 libgnuradio-core.so.0.0.0 gr_fast_atan2f(float, float)
22000 2.2157 libgnuradio-core.so.0.0.0
gr_single_threaded_scheduler::main_loop()
20853 2.1002 libgnuradio-core.so.0.0.0 .cleanup
13089 1.3182 libmozjs.so (no symbols)
12536 1.2625 libgnuradio-core.so.0.0.0
gr_ofdm_frame_acquisition::coarse_freq_comp(int, int)
12384 1.2472 libgnuradio-core.so.0.0.0 gr_fft_filter_ccc::work(int,
std::vector<void const*, std::allocator<
void const*> >&, std::vector<void*, std::allocator<void*> >&)
12080 1.2166 libm-2.5.so __ieee754_hypotf
11982 1.2067 libgnuradio-core.so.0.0.0
gr_ofdm_frame_acquisition::general_work(int, std::vector<int, std::al
locator >&, std::vector<void const*, std::allocator<void const*>

&,
std::vector<void*, std::allocator<void*

&)
11453 1.1535 libgnuradio-core.so.0.0.0
gr_fir_ccf_generic::filterN(std::complex, std::complex
const
, unsigned long)
10726 1.0802 libgnuradio-core.so.0.0.0 gr_multiply_cc::work(int,
std::vector<void const*, std::allocator<voi
d const*> >&, std::vector<void*, std::allocator<void*> >&)
10250 1.0323 libgnuradio-core.so.0.0.0
gr_peak_detector_fb::work(int,
std::vector<void const*, std::allocato
r<void const*> >&, std::vector<void*, std::allocator<void*> >&)
10135 1.0207 libfftw3f.so.3.1.2 t2fv_32

  • for profiling benchmark_ofdm_tx.py gives:

21303 10.8694 libpython2.4.so.1.0 (no symbols)
18889 9.6377 vmlinux-2.6.18-53.1.19.el5 mwait_idle
8283 4.2262 libgnuradio-core.so.0.0.0
gr_single_threaded_scheduler::main_loop()
6579 3.3568 libgnuradio-core.so.0.0.0
gr_multiply_const_cc::work(int,
std::vector<void const*, std::allocator<void const*> >
&, std::vector<void*, std::allocator<void*> >&)
6266 3.1971 vmlinux-2.6.18-53.1.19.el5 apic_timer_interrupt
5521 2.8170 ehci_hcd (no symbols)
5106 2.6052 libc-2.5.so memcpy
5011 2.5568 libc-2.5.so strcmp
4356 2.2226 libfftw3f.so.3.1.2 t2bv_64
3691 1.8833 libcrypto.so.0.9.8b (no symbols)
3471 1.7710 libgnuradio-core.so.0.0.0
gr_ofdm_mapper_bcv::work(int,
std::vector<void const*, std::allocator<void const*> >&,
std::vector<void*, std::allocator<void*> >&)
2769 1.4128 vmlinux-2.6.18-53.1.19.el5 schedule
2675 1.3649 oprofiled (no symbols)
2534 1.2929 _usrp1.so
usrp1_sink_c::copy_to_usrp_buffer(std::vector<void const*,
std::allocator<void const*>

&, int, int, int&, void*, int, int&)
2403 1.2261 libX11.so.6.2.0 (no symbols)
2212 1.1286 libgnuradio-core.so.0.0.0
gr_ofdm_cyclic_prefixer::work(int, std::vector<void const*,
std::allocator<void const*

&, std::vector<void*, std::allocator<void*> >&)
1978 1.0092 libgnuradio-core.so.0.0.0
gr_block_detail::input(unsigned
int)

Also .cleanup?

.cleanup is a just a label toward the bottom of the SIMD code.
It’s unlikely you’re getting any serious amount of hits there, it’s
not in a loop.

Eric

I guess I need to study the mapping from python code to c files and
assembly
codes.
What would you suggest doing, reading, etc in order to figure out this
mapping and skeleton of ofdm example?
My eventual goal is to firgure out which piece of code is taking most of
the
resources for ofdm.

Thanks!

Mikie