PRU Multiply and accumulate - how many cycles?

Lenny · April 16, 2014, 4:31pm

Hi,

I started using the Multiply-and-Accumulate module of the PRU of my Beaglebone Black to construct a real-time digital filter. But as soon as I insert an instruction of the type, with any data or any of the MAC registers (r25-r29) in the code, I induce some non-negligeable latency.
MOV r25,0x00000000
XOUT 0,r25,1

I would estimate that the XOUT or XIN instructions cost about 5 microseconds, that is something near 1024 PRU clock cycles. I have observed other awkward behaviour of the Multiplier module, for example it only properly multiplied the values in registers r28 and r29 when I did a XOUT 0,r28,8 instruction before reading the result. Does someone have experience with this module? Is it possible that I am using it wrong, that it has to be activated properly before being usable, or that the hardware is not working correctly?

Thanks for any help!
Lenny

Bas_Laarhoven1 · April 16, 2014, 4:36pm

Something is definitely wrong with your setup. I’m using the multiplier for my BeBoPr stepper driver and IIRC the instruction takes only one PRU cycle. There’s some overhead for set-up and to check the flags, but certainly nothing like you’re mentioning. – Bas

Lenny · April 16, 2014, 4:55pm

Hi, thanks for the reply. I also noticed something else which might be related: To synchronize both PRU’s, one sends every 250 clock cycles some data with a command

XOUT 10,r13,4

which the other receives via

XIN 10,r13,4.

This works well. If however i mismatch the registers, that is i combine

XOUT 10,r13,4 on PRU0

with
XIN 10,r14,4 on PRU1
th

i still synchronize the two PRU’s, even though I would expect both PRUs to simply stall here for 1024 cycles.

Just thought this might be related to the other XIN/XOUT problems with the multiplier.

Lenny · April 16, 2014, 11:59pm

Thanks for the help Bas!

I guess i found the problem:

In my code, I used PRU0 simply as a timer with the code

LOOP:
WAIT(250 cycles)
XOUT 14,r5,4 //transfer register r5 from PRU0 to PRU1
JMP LOOP

In the meantime, PRU1 did some tasks, including multiplication using XOUT/XIN 0,r25,1 and similar instructions, and finally should have stalled at the instruction
XIN 14,r5,4
in order to synchronize with PRU0.

However, if the timing is initially not right, it can happen that PRU0 waits for the other PRU while blocking the XCHG port with its XOUT 14,… command. If now PRU1 wants to retrieve the result of a multiplication, e.g. execute XIN 0,r26,4, then it will wait until the XCHG port is liberated by PRU0, which itself will wait for maximally 1024 cycles if PRU1 accepts its XOUT request while keeping the port blocked, such that PRU1 can never get to that section in the code in time. In this case the two PRU’s block each other and the programs runs about 1000 times slower!

Also, the controls which are run to ensure a proper transfer through the XCHG port are quite basic: It seems to me that for a successful transfer between two PRUs, one only needs one PRU that is willing to write (launching XOUT 14,… ) and the other willing to read (XIN 14,…). The actual registers which are to be read or written, or the amount of data does not have to match between the two commands. If they dont match, I dont know what data is actually written, but at least none of the PRUs stalls for 1024 cycles.

Bas_Laarhoven1 · April 17, 2014, 8:14am

Ah, you didn't mention you were synchronizing the PRUs in your first mail!

Note that use of the MAC is not that obvious, and there is (was?) little documentation.
Some issues I found had to do with the overflow/carry flag. It seems to lag one cycle and it cannot be cleared without clearing the entire accumulator.

Now this information is more than a year old (11/2012), so maybe the silicon and/or documentation have been fixed in the meantime, but be prepared for some surprises!

-- Bas