Question on cycle based Counter inside OMAP/ARM

On X86, there are Time Stamp Counter (TSC) which can be read from user
mode to get the elapse cycles easily. TSC counter will simply increase
ONE on every cycle.

Is there any similar cycle counter inside OMAP or ARM? It will be very
useful to monitor performance of my application w/o need of Oprofile.

Thanks!

Samuel

There are the performance counters - i think you can enable user-land
access to them via the kernel, but i don't know how, i only used them
without linux so I don't know how far it's support goes (e.g. for
context switches).

There are 4 of them, and they can be programmed to count all manner of
quite interesting things, from cache misses to predicted branches to
the simple passage of time or instructions. There is also a simple
cycle counter which just counts up, and can be pre-scaled by 64 for
longer periods.

See DDI0406B: "Arm TRM ARMv7-A and ARMv7-R edition", Chapter C9 -
Performance Monitors.

!Z

Performance counter is too heavy for most usage and it is used for
detailed performance tuning.
If there is a similar counter like X86's TSC(increase 1 on every
machine cycle,like watch's tick and readable globally), it will be
very convenient. Since user only need to read the TSC register
directly. 2 TSC value's difference is simply what's the elapse cycles
between 2 TSC reading.

Any more information?

Samuel

Performance counter is too heavy for most usage and it is used for
detailed performance tuning.

Err? It is? If you don't need detailed performance tuning use gettimeofday().

If there is a similar counter like X86's TSC(increase 1 on every
machine cycle,like watch's tick and readable globally), it will be
very convenient. Since user only need to read the TSC register
directly. 2 TSC value's difference is simply what's the elapse cycles
between 2 TSC reading.

Well like i said in my reply, there is a simple cycle counter too.

Any more information?

Try reading the manuals and using google.

Michael Zucchi <notzed@gmail.com> writes:

Performance counter is too heavy for most usage and it is used for
detailed performance tuning.

Err? It is?

Of course not. The performance counters have no overhead at all.

If you don't need detailed performance tuning use gettimeofday().

gettimeofday() is much more expensive. Reading the Cortex-A8 cycle
counters with an MRC instruction takes 50 cycles while a call to
gettimeofday() takes about 1000 cycles.

If there is a similar counter like X86's TSC(increase 1 on every
machine cycle,like watch's tick and readable globally), it will be
very convenient. Since user only need to read the TSC register
directly. 2 TSC value's difference is simply what's the elapse cycles
between 2 TSC reading.

Well like i said in my reply, there is a simple cycle counter too.

Any more information?

Try reading the manuals and using google.

The ARM ARM and Cortex-A8 TRM would be good reading.

Here is a patch to make the counters available directly from userspace:
http://git.mansr.com/?p=linux-omap;a=commitdiff;h=5170038

Thanks Måns Rullgård and Michael Zucchi !
Yes, gettimeofday() needs much more cycles and granularity might be
too big.
Could you share me more one how to use MRC instruction to read some
cycle counter? e.g. which counter, how? 50 cycle overhead looks ok
for me.

BTW, I must clarify that "heavy" in my context is that :usage step is
not very easy, since I must communicate PMU via kernel module, the
usage mode is some how "heavy", which need more code and debugging
than read register directly from user space..... I know the overhead
of PMU is light. :slight_smile:

I prefer some counter not lived inside PMU, while if there isn't any
choice beside PMU, I will try PMU cycle counter. For the patch of
usage space visiting of PMU, need I re-compile kernel? or it is
already up-streamed? Is there any usage example?

Thanks again!

Samuel

I also found some discuss at http://blog.gmane.org/gmane.linux.ports.arm.general/month=20080801
to read CCNT of PXA processor.
It seems CCNT of PMU is a nice and only choice.
Appreciate any guide to read CCNT based on your patch.

BTW, if I didn't have chance of re-build kernel for missing of
customized kernel source , is it possible to write a driver (.ko) to
read CCNT and how?

Samuel

Hi.

Yes you can use the fixed function cycle counter CCNT in your OMAP application.

You could use one of the programable function counters, programmed with the function of counting cycles as well to have multiple hardware sources, but my experience says it depends on OMAP silicon (http://e2e.ti.com/support/arm174_microprocessors/omap_applications_processors/f/42/p/38720/135485.aspx#135485).

Documentation:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/DDI0344J_cortex_a8_r3p2_trm.pdf

What you want is around page 210 (the mrc/mcr commands involved).

You basically need the following:
- program bit 0 in USEREN reg from the kernel to enable userland access to the counters
- program PMNC bit 0 to enable counter hardware
- program CNTENS bit 31 to enable CCNT counting
- potentially also program CNTENC bit 31 to disable CCNT counting
- unmask iPMU_IRQ on your OMAP
- read CCNT to get the current 32bit hardware value of the counter

Gabi Voiculescu

— On Tue, 3/23/10, Samuel samuel.xu.tech@gmail.com wrote:


> From: Samuel samuel.xu.tech@gmail.com
> Subject: [beagleboard] Re: Question on cycle based Counter inside OMAP/ARM
> To: “Beagle Board” beagleboard@googlegroups.com
> Date: Tuesday, March 23, 2010, 2:21 PM
>
> I also found some discuss at http://blog.gmane.org/gmane.linux.ports.arm.general/month=20080801
> to read CCNT of PXA processor.
> It seems CCNT of PMU is a nice and only choice.
> Appreciate any guide to read CCNT based on your patch.
>
> BTW, if I didn’t have chance of re-build kernel for missing of
> customized kernel source , is it possible to write a driver (.ko) to
> read CCNT and how?
>
> Samuel
>
> On Mar 23, 8:05 pm, Samuel samuel.xu.t...@gmail.com wrote:
> > Thanks Måns Rullgård and Michael Zucchi !
> > Yes, gettimeofday() needs much more cycles and granularity might be
> > too big.
> > Could you share me more one how to use MRC instruction to read some
> > cycle counter? e.g. which counter, how? 50 cycle overhead looks ok
> > for me.
> >
> > BTW, I must clarify that “heavy” in my context is that :usage step is
> > not very easy, since I must communicate PMU via kernel module, the
> > usage mode is some how “heavy”, which need more code and debugging
> > than read register directly from user space… I know the overhead
> > of PMU is light. :slight_smile:
> >
> > I prefer some counter not lived inside PMU, while if there isn’t any
> > choice beside PMU, I will try PMU cycle counter. For the patch of
> > usage space visiting of PMU, need I re-compile kernel? or it is
> > already up-streamed? Is there any usage example?
> >
> > Thanks again!
> >
> > Samuel
> >
> > On Mar 23, 7:26 pm, Måns Rullgård m...@mansr.com wrote:
> >
> > > Michael Zucchi not...@gmail.com writes:
> > > > On 23 March 2010 21:27, Samuel samuel.xu.t...@gmail.com wrote:
> > > >> Performance counter is too heavy for most usage and it is used for
> > > >> detailed performance tuning.
> >
> > > > Err? It is?
> >
> > > Of course not. The performance counters have no overhead at all.
> >
> > > > If you don’t need detailed performance tuning use gettimeofday().
> >
> > > gettimeofday() is much more expensive. Reading the Cortex-A8 cycle
> > > counters with an MRC instruction takes 50 cycles while a call to
> > > gettimeofday() takes about 1000 cycles.
> >
> > > >> If there is a similar counter like X86’s TSC(increase 1 on every
> > > >> machine cycle,like watch’s tick and readable globally), it will be
> > > >> very convenient. Since user only need to read the TSC register
> > > >> directly. 2 TSC value’s difference is simply what’s the elapse cycles
> > > >> between 2 TSC reading.
> >
> > > > Well like i said in my reply, there is a simple cycle counter too.
> >
> > > >> Any more information?
> >
> > > > Try reading the manuals and using google.
> >
> > > The ARM ARM and Cortex-A8 TRM would be good reading.
> >
> > > Here is a patch to make the counters available directly from userspace:http://git.mansr.com/?p=linux-omap;a=commitdiff;h=5170038
> >
> > > –
> > > Måns Rullgård
> > > m...@mansr.com
>
> –
> You received this message because you are subscribed to the Google Groups “Beagle Board” group.
> To post to this group, send email to beagleboard@googlegroups.com.
> To unsubscribe from this group, send email to beagleboard+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beagleboard?hl=en.

|

Samuel <samuel.xu.tech@gmail.com> writes:

Michael Zucchi <not...@gmail.com> writes:
>> Performance counter is too heavy for most usage and it is used for
>> detailed performance tuning.

> Err? It is?

Of course not. The performance counters have no overhead at all.

> If you don't need detailed performance tuning use gettimeofday().

gettimeofday() is much more expensive. Reading the Cortex-A8 cycle
counters with an MRC instruction takes 50 cycles while a call to
gettimeofday() takes about 1000 cycles.

>> If there is a similar counter like X86's TSC(increase 1 on every
>> machine cycle,like watch's tick and readable globally), it will be
>> very convenient. Since user only need to read the TSC register
>> directly. 2 TSC value's difference is simply what's the elapse cycles
>> between 2 TSC reading.

> Well like i said in my reply, there is a simple cycle counter too.

>> Any more information?

> Try reading the manuals and using google.

The ARM ARM and Cortex-A8 TRM would be good reading.

Here is a patch to make the counters available directly from userspace:http://git.mansr.com/?p=linux-omap;a=commitdiff;h=5170038

Thanks Måns Rullgård and Michael Zucchi !
Yes, gettimeofday() needs much more cycles and granularity might be
too big.
Could you share me more one how to use MRC instruction to read some
cycle counter? e.g. which counter, how? 50 cycle overhead looks ok
for me.

BTW, I must clarify that "heavy" in my context is that :usage step is
not very easy, since I must communicate PMU via kernel module, the
usage mode is some how "heavy", which need more code and debugging
than read register directly from user space..... I know the overhead
of PMU is light. :slight_smile:

I prefer some counter not lived inside PMU, while if there isn't any
choice beside PMU, I will try PMU cycle counter. For the patch of
usage space visiting of PMU, need I re-compile kernel? or it is
already up-streamed? Is there any usage example?

To use the cycle counter from userspace, apply the patch linked above,
enable the new config option and rebuild the kernel. In your app, use
these functions to access the counter:

static inline void ccnt_start(void)
{
    __asm__ volatile ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31));
}

static inline void ccnt_stop(void)
{
    __asm__ volatile ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31));
}

static inline unsigned ccnt_read(void)
{
    unsigned cc;
    __asm__ volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
    return cc;
}

static inline void ccnt_init(void)
{
    ccnt_stop();
    __asm__ volatile ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5));
}

If you don't stop the counter after you're done with it, oprofile will
be unhappy, should you wish to use it. Needless to say, using this
while oprofile is running is not a good idea.