Cool, thanks for the update Carlos! I thought a little about your problem and was wondering whether you are really at the limit of the PRUs capability:

You say you have roughly 20kSps with 10bit resolution for four outputs, so roughly 200k updates of the output pins per second. That means you need roughly 10 instructions for changing the PWM outputs to their next value. What do you think of the following implementation? Is it not optimal but I think its performance should be comparable to yours but roughly 10x faster:

START:

XIN 10,r0,120 //fetches data from scratchpad0 - each register holds 8x4bits, where bit 0,4,8,12 defines four consecutive values for output 0, bit 1,5,9,13 for output 1 and so on.

AND R30.b0, r0.b0, 0x0F //move the bits 0:3 of r0.b0 to r30.b0 (the other bits of R30.b0 are set to zero by the AND) - here it is assumed that your output pins correspond to r30[0:3]. That can easily be adapted to any other consecutive 4 bits by using a LSL instruction

LSR R30.b0,r0.b0,4 //move the bits 4:7 of r0.b0 to r30.b0 (the other bits of R30.b0 are set to zero by the LSR)

AND R30.b0, r0.b1, 15 //same for r0.b1

LSR R30.b0,r0.b1,4

//repeat this instruction for all other bytes up to…

AND R30.b0, r29.b3, 15

LSR R30.b0,r29.b3,4

//up to here, 30*8=240 sets of 4 bits were written to the output*

XIN 11,r0,120 //fetches data from scratchpad1

//repeat the write operations

XIN 12,r0,120 //fetches data from scratchpad2

//repeat the write operations

XIN 14,r0,120 //fetches data from other PRU

//repeat the write operations

JMP START

//here a total of 4240=960 writes on 4 digital outputs r30[0:3] has been executed. So we have already about 9.9 bits resolution and - given an overhead of 5 cycles - an update rate of 200MHz*960/965 = roughly 199MHz. That is the actual output will be roughly 200kSps.

The 960 cycles leave plenty of time for PRU1 to perform 4 LBBO operations that are 120byte wide: Typically each one should take less than 80 cyccles. So the code will look sth like

LBBO from DRAM filling all registers r0:r29

XOUT 10,r0,120

LBBO from DRAM filling all registers r0:r29

XOUT 11,r0,120

LBBO from DRAM filling all registers r0:r29

XOUT 12,r0,120

LBBO from DRAM filling all registers r0:r29

XOUT 14,r0,120

//here the code will stall until PRU0 requests the data with the corresponding XIN operation.

This is the simplest implementation. You can make it more sophisticated by inserting some logic into PRU1 code that modulates the output data over several cycles to get a higher output resolution. But even in this implementation, the only irregularity comes from the XIN’s and the final jump operation in PRU0, which should be taken into account in the algorithm that generates the data which fills all of the registers (the value of register r29[28:31] counts twice if its in the scratchpad, and threefold if its in PRU1). This is a complication, but actually leads to a neat trick: If you want to invest some of the gained speed into higher resolution, you can increase the weight of any 4bit datapoint by inserting a loop after it that just holds that particular output value for that number of cycles (you need to reserve a register for the counter in that case, but you will still gain in resolution). If one loop of the PRU0 code takes much more than 1024 cycles, you should also insert a loop of similar length into PRU1 code in order to not have PRU1 stall more than 1024cycles for PRU0’s XIN command.