why does reading the DMTIMER clock take so long?

I got a Beaglebone for robotics, where low latency operation is
important (especially, if I want to control 40 pins). I found the
sysfs interface to control GPIO is too slow (~2200 ns), so I resorted
to directly modifying the GPIO registers, which got the time down to
~200 ns. Then I wanted to read the clock via
clock_gettime(CLOCK_REALTIME), but that's also very slow (1500 ns).
Again I resorted to directly reading the DMTIMER2 counter and got the
time down to about 100ns. I still think this is a bit high. Can
someone enlighten me why it would take this long to read an on chip
value?

volatile uint32_t *dmtimer2_regs = (uint32_t *)mmap(NULL, 0x1000,
PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x48040000);

while (true)
{
  uint32_t t0 = dmtimer2_regs[0x3c / 4];
  uint32_t t1 = dmtimer2_regs[0x3c / 4];
  cout << t1 - t0 << endl; // typically 5
clock ticks (each clock tick is ~41ns), hence each time stamp read
takes ~100ns
}

Actually, I think my estimate is off by a factor of 2. It seems the
round trip time to read that value would be 200ns based on that data,
assuming equal times for the request and response. Why so long?

Any more insight into this?

I found your post quite interesting so I tried it out myself. Two
back-to-back reads of the dmtimer "TCRR" register give me values that
usually differ by 5 (sometimes 6, occasionally 4). If these ticks are
41ns long then I too am seeing ~200ns between successive reads.

Looking at the assembler code generated by the C compiler, the code
looks good (i.e. just a few machine instructions to do the loads).
Surely we'd expect to see this take a dozen or two clock cycles at
most (i.e maybe 30-40ns)...?

In reading the manual "SPRUH73C.PDF" I saw it mention that it will
automatically provide synchronized access to the time values (e.g. if
a carry from the lower word to the upper word in in progress). Surely
that's not responsible for this sort of delay?

Wow, I completely forgot about this thread until someone just emailed me with a question. Thanks for your interest.

The file I opened is “/dev/mem”

int fd = open("/dev/mem", O_RDWR | O_SYNC); //O_SYNC makes the memory uncacheable

The file is a view of the entire 4GB physical memory space. You need to be root to open it obviously.

I also have a conjecture about why reading the clock is slow. The DM timer isn’t part of the ARM architecture, so it has to read it externally. Today as processors integrate more and more devices onto the same chip, packet switched networks are used more and more instead of dedicated wires since a packet interface is general purpose and will have higher utilization. But packet switching will increase latency.

I’d like to know why ARM didn’t include a clock as part of the architectural state. Sure it can save some power, but it practically all embedded devices need a clock and reading an external one will use more power. At least one other computer architect agrees:

https://www.youtube.com/watch?v=J9kobkqAicU

At 22:30, Burton laments about the lack of a user readable clock on today’s processors.

Yale