BeagleBone Black GPIO a bit slow via /dev/mem and mmap?

Hi, folks,

I have been trying to bit bang an interface on the GPIO pins. I looked at a bunch of the tutorials on the web and managed to build something that worked. However, the speed seems to be a bit slower than I would expect.

My scope shows that I am running at about 2.95MHz (It also roughly matches the result from here–http://chiragnagpal.com/examples.html of 2.78MHz). That seems slow–I would have expected closer to 25MHz–so I think I’m an order of magnitude off somewhere The assembly code seems to be as expected, so my question is:

What is slowing things down?

Since the assembly is basically ldr/str in a chain, something must be stalling the pipeline, but I don’t know what.

Suggestions would be appreciated.

I’m running Debian Jessie Linux arm 4.1.6-ti-r11 #1 SMP PREEMPT Tue Aug 18 21:36:11 UTC 2015 armv7l GNU/Linux

Thanks.

The assembly code looks optimal at 4 instructions per toggle (I’m using clang):

`

.loc 2 80 2 @ gpi.c:80:2
ldr r1, [sp, #28]
str r0, [r1]
.loc 2 81 9 @ gpi.c:81:9
ldr r1, [sp, #32]
str r0, [r1]
.loc 2 82 2 @ gpi.c:82:2
ldr r1, [sp, #28]
str r0, [r1]
.loc 2 83 9 @ gpi.c:83:9
ldr r1, [sp, #32]
str r0, [r1]

`

The C Code I used to toggle the pin P9_23 (GPIO1_17):

`

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

#include <unistd.h>
#include <assert.h>

#define GPIO_OE 0x134
#define GPIO_SETDATAOUT 0x194
#define GPIO_CLEARDATAOUT 0x190

// Hunt these addresses down from ls -al /sys/devices/platform/ocp | grep gpio
// You can also pull them from the TI manual (spruh73l.pdf)
#define GPIO0_BASE 0x44E07000
#define GPIO1_BASE 0x4804C000
#define GPIO2_BASE 0x481AC000
#define GPIO3_BASE 0x481AE000

#define GPIO_SIZE 0x00001000

#define PIN_17 ((uint32_t)1<<17)

uint32_t ui32Base[] = {GPIO0_BASE, GPIO1_BASE, GPIO2_BASE, GPIO3_BASE};
uint8_t volatile * bbGPIOMap[] = {0, 0, 0, 0};

int main(int argc, char *argv[])
{
unsigned int ui;

uint32_t volatile * gpio_oe_addr = NULL;
uint32_t volatile * gpio_setdataout_addr = NULL;
uint32_t volatile * gpio_cleardataout_addr = NULL;

int fd = open("/dev/mem", O_RDWR);

for(ui=0; ui<4; ++ui) {
bbGPIOMap[ui] = mmap(0, GPIO_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, ui32Base[ui]);
assert(bbGPIOMap[ui] != MAP_FAILED);
}

gpio_oe_addr = (uint32_t volatile *)(bbGPIOMap[1] + GPIO_OE);
gpio_setdataout_addr = (uint32_t volatile *)(bbGPIOMap[1] + GPIO_SETDATAOUT);
gpio_cleardataout_addr = (uint32_t volatile *)(bbGPIOMap[1] + GPIO_CLEARDATAOUT);

*gpio_oe_addr = *gpio_oe_addr & ~PIN_17;

while(1) {
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
*gpio_setdataout_addr = PIN_17;
*gpio_cleardataout_addr = PIN_17;
}

//*(uint32_t volatile )(bbGPIOMap[3] + GPIO_SETDATAOUT) = (uint32_t)1 << 14;
//
(uint32_t volatile *)(bbGPIOMap[3] + GPIO_CLEARDATAOUT) = (uint32_t)1 << 14;

close(fd);
return 0;
}

`

That is not optimal, loads are expensive (although I would have
thought the stack would be cached).

Optimal would be a sequence of successive stores to the set and clear
register, with the addresses pre-calculated and sitting in registers
(or with a base address and use the str instruction with an immediate
offset). You can probably coerce the C compiler into doing this for
you if you setup a struct or array pointer for the set/clear registers
instead of using the two volatile pointers. You might also need to
move the volatile on the variable definitions. I'm not a "C" guru,
but I think:

  uint32_t volatile * gpio_setdataout_addr

...is different from:

  volatile uint32_t * gpio_setdataout_addr

...you want the uint32_t to be volatile, not the pointer to it.

I'd still set this up more like:

  struct reg {
    clr uint32_t;
    set uint32_t;
  }

  volatile reg * gpio_reg = NULL;
  ...
  gpio_reg.set = PIN_17;
  gpio_reg.clr = PIN_17;

...but I write C code like a hardware guy who programs in VHDL. :slight_smile:

Regardless, note that the maximum toggle rate of a GPIO pin using the
PRU is about 12.5 MHz, dictated by the ~40 nS required to execute a
write by the on-chip communication fabric (which means 80 nS period
for a high/low toggle of the pin). This assumes the CPU and the PRU
have similar bandwidth to the L4 fabric, which may or may not be the
case (but I suspect is true).

So it looks like you've got some room for improvement, but you're not
off by an order of magnitude or anything.

Optimal would be a sequence of successive stores to the set and clear
register, with the addresses pre-calculated and sitting in registers
(or with a base address and use the str instruction with an immediate
offset).

Good point. I can beat the compiler into submission if I want to, but it’s probably not worth the effort as I need to do other logic anyway.

You might also need to
move the volatile on the variable definitions. I’m not a “C” guru,
but I think:

uint32_t volatile * gpio_setdataout_addr

…is different from:

volatile uint32_t * gpio_setdataout_addr

…you want the uint32_t to be volatile, not the pointer to it.

This is one of those C quirks. It’s actually a habit from “const” but it applies to “volatile” as well. Oddly, the qualifiers generally apply to the left except under certain circumstances where they can apply right. This gets people into trouble when they specify:

volatile uint32_t volatile * blah;

Which is actually a redundant specifier as opposed to what they wanted:

volatile uint32_t * volatile blah;

Not every compiler (especially embedded ones) will eject a warning on the duplicate declaration specifier. So, I personally always put them on the right to avoid the issue.

…but I write C code like a hardware guy who programs in VHDL. :slight_smile:

Not a damn thing wrong with that, thank you very much. Software-only people tend to be amazed that you can write a state machine that is readable in code. A very experienced programmer once told me: “Your code is the most straightforward code I have ever read.”

Of course, I have to condemn you as a dirty apostate for not using Verilog. :slight_smile: (Seriously, though, when is somebody going to produce an open-source VHDL simulator/compiler so that I can actually use it on projects?)

Regardless, note that the maximum toggle rate of a GPIO pin using the
PRU is about 12.5 MHz, dictated by the ~40 nS required to execute a
write by the on-chip communication fabric (which means 80 nS period
for a high/low toggle of the pin). This assumes the CPU and the PRU
have similar bandwidth to the L4 fabric, which may or may not be the
case (but I suspect is true).

Do you have a pointer to the reference manual for this (if not, don’t waste a lot of time, I’ll dig it out)? Given that this seems to be very fundamental about understanding the architecture, I really probably need to chase this down exactly.

Thanks for the advice. I appreciate your taking the time on this.

This is one of those C quirks. It’s actually a habit from “const” but it applies to “volatile” as well. Oddly, the qualifiers generally apply to the left except under certain circumstances where they can apply right. This gets people into trouble when they specify:

volatile uint32_t volatile * blah;

Which is actually a redundant specifier as opposed to what they wanted:

volatile uint32_t * volatile blah;

Not every compiler (especially embedded ones) will eject a warning on the duplicate declaration specifier. So, I personally always put them on the right to avoid the issue.

The volatile keyword is often misunderstood. But I do not think the placement of the keyword matters so much as long it is either before or immediately after the type. So like volatile int i or int volatile i would be the same.

I think the actual scope matters more. e.g. global versus local scope. But maybe I’m remembering wrongly as I recall reading something to this effect years ago. Anyway, I find this link the best single resource for explaining what volatile is - And . . I’m not trying to start an argument or anything, I just like discussing programming in general.

http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword

Refer to Chapter 10 in the TRM, which is all about the interconnects
(on-chip buses). The GPIO live on L4_PER (except GPIO0, which is on
L4_WKUP) and are accessed via the L3S (L3 Slow clock domain). You can
also see how the PRU and CPU tie into the interconnect (both tie to
the L3F or Fast clock domain), and what peripherals they can access
(both PRU and CPU can access the GPIO, no surprise there!).

I haven't done real-world timing tests with the CPU, but the results
of my real-world tests with the PRU are documented in the code I wrote
for the Machinekit project:

https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/hal_pru_generic/pru_generic.p#L137-L165

Assuming the performance bottleneck is the actual GPIO (which seems
likely), you should see similar performance metrics to the PRU when
accessing the GPIO banks from the CPU.

I think the actual scope matters more. e.g. global versus local scope. But maybe I’m remembering wrongly as I recall reading something to this effect years ago. Anyway, I find this link the best single resource for explaining what volatile is - And . . I’m not trying to start an argument or anything, I just like discussing programming in general.

http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword

Um, that article is in violent agreement with me. :slight_smile:

So in that article I suppose Dan Saks is talking about function signatures, in C++ versus C. The explanations seem very contrived, and I’m not sure I’d consider much of the discussed code “good form”

On the other hand, in:
int f(char *const p);
the const qualifier is at the top level, so
it is not part of the function’s signa-
ture. This function has the same sig-
nature as:
int f(char *p);

This “form” or style of code is bad going by anything I’ve read. So . . .

inf f(const char *p) or maybe int f(char const *p). But unless I misunderstood what Dan Saks is trying to say here. The article heading should have been “Asterisk placement”, and not top-level CV-qualifiers. Or maybe the point is still valid, but could have been avoided entirely by using “better” coding style. shrug