Profiling on ARM

Hi,

I want to profile my application running on the ARM core on
BeagleBoard. I am using Code Sourcery's (free) GNU toolchain to cross-
compile my app. I wonder if the best way to profile would be using
GNUProf, which Code Sourcery provides in the toolchain? Are there
other (recommended) ways of profiling?

Also, currently I have my app compiled and tested on top of Ubuntu
running on the ARM. Since the intended target is an embedded platform,
would it be better to profile the app running on an embedded OS (like
Angstrom)?

Thanks & cheers,
Sunny

Op 10 feb 2009, om 04:14 heeft sundeeepgupta@gmail.com het volgende geschreven:

Hi,

I want to profile my application running on the ARM core on
BeagleBoard. I am using Code Sourcery's (free) GNU toolchain to cross-
compile my app. I wonder if the best way to profile would be using
GNUProf, which Code Sourcery provides in the toolchain? Are there
other (recommended) ways of profiling?

Also, currently I have my app compiled and tested on top of Ubuntu
running on the ARM. Since the intended target is an embedded platform,
would it be better to profile the app running on an embedded OS (like
Angstrom)?

The angstrom demos have oprofile installed by default.

regards,

Koen

I saw some emails about oprofile having counter issues on the armv7.
Does anyone have a good summary of the issue and how it impacts
oprofile on the beagle?

Philip

Without going into the details, the summary of the problem is the following.

Under certain conditions, PMU unit of Cortex-A8 core (at least for
r1pX revisions which are used in beagleboard) gets messed up,
interrupts get disabled and oprofile stops collecting samples. If you
are profiling just some number crunching application which does not
use system calls much, you are unlikely to encounter it. On the other
hand, for example repeatedly calling 'gettimeofday' function in a
tight loop triggers the problem almost instantly.

Best regards,
Siarhei Siamashka

Op 10 feb 2009, om 14:45 heeft Siarhei Siamashka het volgende geschreven:

for example repeatedly calling 'gettimeofday' function in a
tight loop triggers the problem almost instantly.

Doesn't gettimeofday kinda kill perfomance anyway?

regards,

Koen

This is just a testcase to reproduce the problem. I don't feel much
relieved knowing that it takes a lot longer (dozens of seconds or even
several minutes instead of a fraction of second) to break when
profiling some real applications.

Best regards,
Siarhei Siamashka

Koen Kooi <koen@beagleboard.org> writes:

Op 10 feb 2009, om 14:45 heeft Siarhei Siamashka het volgende
geschreven:

for example repeatedly calling 'gettimeofday' function in a
tight loop triggers the problem almost instantly.

Doesn't gettimeofday kinda kill perfomance anyway?

Ever done an strace on firefox?

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

Hi,

Siarhei Siamashka schrieb:

Under certain conditions, PMU unit of Cortex-A8 core (at least for
r1pX revisions which are used in beagleboard) gets messed up,
interrupts get disabled and oprofile stops collecting samples. If you
are profiling just some number crunching application which does not
use system calls much, you are unlikely to encounter it. On the other
hand, for example repeatedly calling 'gettimeofday' function in a
tight loop triggers the problem almost instantly.

OMG, that makes using oprofile for a lot of applications quiet useless
IMO. :frowning:

a) Is there a way of detecting that the issue has occured?

b) Is a workaround (e.g. kernel patch) possible to fix this issue reliably?

Regards
Robert

Hi,

Siarhei Siamashka schrieb:

Under certain conditions, PMU unit of Cortex-A8 core (at least for
r1pX revisions which are used in beagleboard) gets messed up,
interrupts get disabled and oprofile stops collecting samples. If you
are profiling just some number crunching application which does not
use system calls much, you are unlikely to encounter it. On the other
hand, for example repeatedly calling 'gettimeofday' function in a
tight loop triggers the problem almost instantly.

OMG, that makes using oprofile for a lot of applications quiet useless
IMO. :frowning:

a) Is there a way of detecting that the issue has occured?

It is possible to check PMU state periodically (PMNC or CNTENS
registers for example) and if it unexpectedly changes to something
else (resets to zero) then it is broken.

b) Is a workaround (e.g. kernel patch) possible to fix this issue reliably?

Well, thanks a lot for asking. Really. I just wanted to reply that no
practical workaround is available but then realized that I had
overlooked something simple :slight_smile:

A patch with a workaround is attached. It is probably missing proper
locking/synchronization which would need to be added, but at least
should work in practice and seems to have almost no impact on
profiling statistics (samples which are related to 'watchdog' timer
activity which monitors PMU state get filtered out and are not taken
into account).

Testing and feedback is very much welcome.

Best regards,
Siarhei Siamashka

0001-ARM-OMAP-Cortex-A8-r1-PMU-bug-workaround-for-oprof.patch (6.16 KB)

I took a quick look at your patch so I can't say much about it.

When I first heard of that bug, I had another idea: why not collect
active counters on a regular basis, accumulate the results and
clear the counters. I don't know if that fits well with oprofile, but that
would prevent any counter from overflowing (and so would prevent
the bug from occurring) provided the timer interrupt happens often
enough (I guess one second is enough given the frequency of
Cortex-A8). Does that make sense?

Laurent

This works fine if all that we need are only cycle precise timestamps
(for use with some kind of instrumentation at the beginning/end of the
interesting parts of code). But the core of Oprofile functionality is
the statistical sampling, it means that we actually want interrupts to
be generated, and lots of them. As the performance counter is more
likely to overflow in the code which uses a lot of cpu cycles (or
whatever other event being monitored), more interrupts will be
triggered in that code and recorded as oprofile samples.
Statistically, there will be more samples collected for the addresses
which are close to the performance bottlenecks. That's the basic idea,
kind of Monte Carlo method from mathematics.

A usable workaround for oprofile PMU based driver should not skew the
statistics and provide relevant results. That's what I'm trying to
achieve with a workaround patch. If it fails at this task and still
adds some noticeable unwanted 'noise' to the results, it just has to
be scrapped and a simple timer based driver should be used instead
(fortunately it can provide sufficient samples collection frequency).

If anybody could try some test profiling with and without workaround
applied and compare results for different test cases and
configurations, that would be very nice.

Best regards,
Siarhei Siamashka

I know this is old thread regarding profiling on arm. I tried the patch on kernel 2.6.32. but oprofile still doesn’t work on XM.
I also checked PMNC by using
.
.
.
asm volatile ( “mrc P15, 0, %0, c9, c13, 2” : “=r” (count));

it is still zero. Has someone solved this issue.