Kernel 5.10 performance issues on BBAI

bbai-test.c (3,6 KB)
revert_i940_workaround.patch (9,4 KB)
Hi guys,
I have developed a custom cape for the BBAI which mainly uses the PRU. The Linux part of my software processes data in a real-time thread and the result is then copied to a buffer in the shared RAM area of the PRU every few microseconds.

When I recently upgraded from kernel 4.14.ti-rt to kernel 5.10-ti-rt, I realized that the RT thread was no longer able to generate the data fast enough (I had constant buffer underruns in the PRU).

Note: to make the following description easy to understand, I have attached the code for a small test program. Just compile it on your BBAI using gcc:

gcc -O3 -mcpu=cortex-a15 -Wall bbai-test.c -o bbai-test

An oscilloscope is required for tests 2 and 3, the CPU frequency was always set to 1.5 GHz and all tests can also be run with the standard kernel (i.e. without real-time extension). In this case, however, the measured time values will be larger!

When I started looking for the cause of the performance problems, I noticed the following:

Test 1: if the clock_gettime() function is called several times in succession under kernel 5.10, it always returns identical values, i.e. the calculated time difference between two successive calls is always 0 until it finally makes a big jump of e.g. 30.517 μs.
So the timer obviously only runs with 32.768 kHz and no longer with ns resolution as in the earlier kernel versions! Under 4.14, the time difference is never zero, but lies in the range of a few hundred ns.

sudo taskset -c 1 chrt --f 51 ./bbai-test -t 1

Test 2: the execution time of clock_gettime() has increased by a factor of 10 under Kernel 5.10! To measure this, pin P8_14 is toggled before and after the call, the time must then be determined with an oscilloscope! Under 4.19 this takes approx. 0.25 μs, under 5.10 it is approx. 2.5 μs, i.e. 10 times as long!

sudo taskset -c 1 chrt --f 51 ./bbai-test -t 2

Test 3: clock_nanosleep() is also much slower! Measured again with the oscilloscope at pin P8_14, “sleep for 1 ns” takes 12 μs under kernel 4.19 and 30 μs under kernel 5.10, so it is 2.5 times slower.

sudo taskset -c 1 chrt --f 51 ./ bbai-test -t 3

Test4: Usually clock_nanosleep will involve task scheduling and context switching.
I therefore wanted to investigate how the kernel version reacts to context switching.
For this purpose I’ve used the following tools:

The result:

Kernel:                  4.19          5.10
thread-switch-condvar    9589.2       21850.0    [ns/switch]   x 2.3
thread-switch-pipe       14904.8      27107.1    [ns/switch]   x 1.8
thread-pipe-msgpersec    32881.04     18322.88   [iters/sec]   x 1,8

So, depending on what you do, the execution of various functions on the AM5729 under kernel 5.10 can slow down by a factor of 1.8 to 10!

During further research, I came across the following patch, which was introduced in kernel 5.10.38:

But obviously the General Purpose Timers only run at 32.768 kHz and the query takes longer because they are connected via the L3/L4 interconnect bus!

That’s why I switched from the GP timers back to the COUNTER_REALTIME in kernel 5.10 as a test (attached patch), which solved the performance problems for me!

To give it a try:

git clone -b ti-linux-rt-5.10.y 
cd ./ti-linux-kernel-dev/ 
git checkout ti-linux-rt-5.10.y -b tmp 
patch -p1 < revert_i940_workaround.patch # apply the attached patch

Then copy the ./deploy/linux-image*.deb onto the BBAI and install it there…

But the question now is whether it would not generally be better to use the COUNTER_REALTIME by default and only use the workaround with the GP timers if the application does not allow a restart within 388 days?
And perhaps it would even be possible to switch to the GP timers via devicetree overlay?

Maybe my first post was a bit unclear and too long…
Because I just have a simple request: can all kernels >= 5.10 for the Beaglbone AI board please be switched back to use the COUNTER_REALTIME?