Proposing the "Offloading to PRU" project idea for GSoC 2016

ZeekHuge · March 15, 2016, 8:05pm

Hello everyone, I am Zubeen Tolani (IRC nickname: ZeekHuge), an undergraduate student, pursuing my degree in electronics and telecommunication. I wish to participate in GSoC 2016. I have worked on some simple projects like, LFRs, Gesture based robots, robots for difficult terrain etc. Have worked on lots of micro-controllers like Atmeg8, Atmega16, PIC micro-controllers, AT89S52 etc. and various sensors like accelerometers, magnetometers etc. I am quite proficient at C and C++ and have also some experience with iOS and Android development.

The Idea I will be proposing for GSoC :

“Offloading the SPI transaction to PRUs using the on SoC SPI subsystem”

Inspiration for the project:

Any useful application of embedded systems, involves use of sensors. A large number of these sensors are essentially based on Analog-to-Digital Conversions (ADC) or Digital-To-Analog Conversion (DAC) for example gyroscopes, accelerometers, light sensors, temperature sensors, etc. Such sensors are said to be IIO (Industrial Input-Output) devices. They can have a sampling rate ranging from that of an SoC ADC to a few Msps. In general, a sampling rate > ~100k samples per seconds is said to be high speed data devices. A typical IIO would communicate over SPI (Serial Peripheral Interface) bus or I2C bus.

SPI peripheral, is essentially just a shift register, sometimes equipped with a FIFO (First In First Out) buffer and an interrupt mechanism, but the data transaction still is the responsibility of the Operating System running on the MPU (Main Processing Unit).

The OS can deploy one of the following approaches, to read data from the SPI bus:

Polling the device to see if the data is there:

Polling is like opening the door again and again to see if someone has arrived or not. This method has a big disadvantage of being very CPU intensive, creating a lot of CPU overhead as, the data transfer rates are much slower than the CPU frequencies and the CPU wont be able to perform any other task while polling the device.

Interrupt driven approach:

This approach isn’t an adequate approach for a general purpose OS like Linux. What Linux would do is, clear the interrupt and schedule the task to read data for sometime later. This would result in loss of data samples and wouldn’t be real-time in nature. This approach isn’t appropriate even in case of RTOS (Real-Time OS), as interrupts would, kind of,saturate the MPU in case of high data transfer rates.

BeagleBone Black’s AM335x SoC, has its two, independent, 32 bit RISC cores, called PRUs.The SoC also consists of an McSPI subsystem. The McSPI subsystem, having a maximum clock frequency of 48 MHz and a 64 byte deep FIFO buffer, can transfer data at a considerably good rate.

This GSoC project aims to utilize the two PRUs and the McSPI subsystem (instead of bit-banging through PRUs) to perform the CPU intensive transactions of data over SPI bus, leaving the MPU with lot of time to perform heavy tasks on this data, the task for which a general purpose OS actually is.

What makes me think this can be achieved ?

The SoC AM335x has various subsystems. The two relevant to this project are :

1.McSPI (Multichannel Serial Port Interface)

and

2.The PRU-ICSS (Programmable Real-Time Unit and Industrial Communication Subsystem)

1.The McSPI :

The McSPI has two SPI modules integrated into it, SPI0 and SPI1. The McSPI thus can be defined as a general purpose SPI module, capable of communicating upto 4 slaves or one single master, external to the SoC.

The Features of the subsystem, relevant to this application are :

Maximum frequency of SPI reference clock can be upto 48 MHz.
Presence of 64 bytes deep FIFO buffer, with registers indicating the state of the buffer (like Buffer almost full, Buffer almost empty).

4.Programmable SPI word length, giving flexibility to the subsystem.

End of Transfer Management Unit, allowing write-and-leave operations by the PRUs.

Simple calculation would show that: (The calculation for only an ideal case )

-The SPI can transfer about 48 * 10^6 bits per second.

-Considering each sample by the IIO to be of 8 bits, the SPI would support an IIO with sampling rate of 6 Msps.

Of-course, that was a complete theoretical approach and practical results would be definitely much lower than that. But achieving only 10% of this theoretical value would allow a sampling rate of 600 Ksps.

So an achievable sample rate can be about 300 Ksps.

2.The PRU-ICSS:

The PRU-ICSS subsystem is the one that has the two PRUs, along with a rich memory arrangement. The RISC based PRUs are highly optimized for hard time constraints, and can execute most of its instructions in single cycle.

Dwelling into the Documents and other resources for PRUs we find that:

1.The ‘best case’ latency, involved in reading from McSPI-0 peripheral is 34 cycles.

2.The SBCO instruction, responsible for moving data from an external physical location to register is 2 cycles for 4 bytes.

3.The registers in the scratch-pad, being ‘broadside connected’ consume up only one cycle to write/swap all the 30 registers.

Now, lets take the latency to read from McSPI-0 be :

*considering 4 bytes data transfer

latency = best case latency to read from SPI (from point 1)

cycles used by instruction SBCO
writing to a single register in the scratch-pad
some bar-part margin

ie,

latency (in cycles) = 34

2
1
163

latency = 200 cycles

PRUs being clocked at 200 MHz, have each cycle of about 5 nSec.

hence 200 cycles = 1000 nSec.

Hence a can transfer data from SPI to PRUs at a rate of 4 Sample in 1000 nSec.

That would be about 4 Msps.

Again, achieving only 10% of this theoretical value would allow a sampling rate of about 400 Ksps.

Thus, an achievable sampling rate can be 400 Ksps.

So, even after such conservative calculations, the sampling rate is quite good to be useful.

Other techniques that can be used

pruss_remoteproc and libpru :

Last year’s (2015) GSoC project, PRU-Framework (was meant for 3.8 kernel, but I have been able to get it working on 4.1 kernel), by Shubhangi Gupta, allows the MPU to easily communicate with the PRUs.

The PRU-Framework, with its “pruss_remoteproc” kernel module, provides the kernel side back-end to perform operations like booting, shutting down, handling resources etc on any of the two PRU cores independently. pruss_remoteproc , along with the virtio_rpmsg_bus provides vring support (instead of being only based on rpmsg) to communicate seamlessly with the PRUs. The virtio_rpmsg_bus, being based on vrings, makes the PRU to appear like a peripheral on the PCI bus, ie the PRUs become a virtual PCI device, as a result, data can also be streamed from the PRUs to the host computer.

The same project provides us with a user space abstraction, in the form of a library called “libpru”, the library can be interfaced using C, and allows us to perform operations like loading, booting, shutting down etc. It can also handle interrupts from the PRUs.

This project requires a strong communication channel between PRUs and the ARM host, which can be provided by the PRU-framework.

How will it provide support for such a large number of devices ?

(Need to think more on this, any suggestion is welcomed)

The idea is basically to implement the SPI driver on the PRU. The data still being sent by the host computer, but PRUs performing the transactions.

The IIO driver, in the mainline linux kernel, is under development since 2009. Being developed in the “staging/” directory, the kernel already has support for a large number of hardwares.

Also, as the IIO devices communicate over the SPI or I2C bus, they must be using the already existing spi and i2c drivers on linux. That makes the communication procedure this:

IIO driver <====> SPI driver <=======> IIO device

As of now, I think, that a new driver, that would implement the exact same interface, as is between IIO driver and SPI drivers, but rather than communicating that data to the SPI subsystem and doing the transactions, it would simply communicate that data to the PRUs, creating almost no CPU overhead and the PRUs would then perform the transaction and implementation part. So the communication would be like :

IIO driver <=====> New driver <====> PRUs <======> SPI subsystem <===> IIO device

(remoteproc based) (present on the AM335x SoC)

So summarizing everything, the project would involve

Writing firmwares for the PRU to enable it to communicate with the on SoC McSPI subsystem and manage transactions .
Writing the firmware for the second PRU to manage data transfer from the PRU-ICSS to the ARM host computer.
Writing a loadable kernel module that would act as a layer between the mainline IIO drivers and the PRUs (Let us call it the newDriver ).

The IIO drivers would see the newDriver as the original SPI driver, so will communicate all the data to it. This newDriver would, further communicate this data to the PRUs. One of the two PRUs, would manage this data transfer and tell the other PRU about the data. The second PRU would communicate to the SPI subsystem to send and receive data.

The above description might have faults and broken connections, I request all the mentors to correct me if I am wrong at any point.

Thank you

ZeekHuge (Zubeen Tolani)

HY0 · March 15, 2016, 8:58pm

Hi,

Can you explain why are you bouncing things off the PRUSS if you are using the
McSPI. The McSPI block is already reasonably self contained and coupled with
DMA, it has reasonably low overhead.

Akshay_Gahlot · March 26, 2017, 10:14pm

How can we implement hardware interrupt in BBB?