Encoding video while fully utilizing the CPU

Hi,

I'm trying to capture live video while fully utilizing the CPU.
I'm using gst-ti and it works great if I just run the pipeline without
the CPU being at 100%
I want to run an algorithm on the frames that are captured as fast as
possible.
To do this I added an identity element which makes a copy of the frame
and creates a thread that analyzes the image.
There is at most only 1 thread of the "algorithm thread" running (it's
basically just one thread that is being fed a new copy when it's done
processing the previous one).

When first starting out I soon hit the problem that because the
algorithm thread is utilizing 99% CPU, the encoding was dropping
frames a lot.
I then changed all threads to SCHED_FIFO priority except the algorithm
thread which uses SCHED_OTHER.
This created far better results!
The problem that I'm having is that sometimes (this can happen after a
couple of minutes) the encoder takes about 400ms to encode there frame
where else it takes 40ms.
I presume this is because the encoder is waiting on the DSP side to
finish as such the algorithm thread gets cpu time and as such the
encode thread arrives to late and takes more time then it should.

av500 suggested to not use SCHED_FIFO as it can be dangerous.

Any suggestions to what I should do to make the encoding have priority
no matter what but still have the algorithm thread do it's job just?

Thanks!

What you have done is IMO generally correct for soft real-time processing.
If your setup has swap I suggest using mlock() or mlockall() in your
app to prevent swap blocking.
If you have SCHED_FIFO you may want to look at what kernel threads you
are potentially pre-empting and adjust the priority level. You may be
creating a situation where the CPU you steal for user-space is needed
for top-half interrupt handlers to queue frames; but at 400ms this is
unlikely to be the case. Another possibility I can think of is that
filesystem commits are causing the delay. A write to flash can
sometime require a flash block erase before the write can occur and
this erase operation is very slow. You could eliminate this by writing
via the network.

Klaas wrote:

Hi,

I'm trying to capture live video while fully utilizing the CPU.
I'm using gst-ti and it works great if I just run the pipeline without
the CPU being at 100%
I want to run an algorithm on the frames that are captured as fast as
possible.
To do this I added an identity element which makes a copy of the frame
and creates a thread that analyzes the image.
There is at most only 1 thread of the "algorithm thread" running (it's
basically just one thread that is being fed a new copy when it's done
processing the previous one).

When first starting out I soon hit the problem that because the
algorithm thread is utilizing 99% CPU, the encoding was dropping
frames a lot.

If your "analyze" thread analyzes a frame in a time < 1/fps,
then it will not use 100% cpu as there are only fps frames per second
to analyze, no? If it takes >= 1/fps, your setup cannot work anyway,
regardless of thread scheduling...

I then changed all threads to SCHED_FIFO priority except the algorithm
thread which uses SCHED_OTHER.
This created far better results!
The problem that I'm having is that sometimes (this can happen after a
couple of minutes) the encoder takes about 400ms to encode there frame
where else it takes 40ms.
I presume this is because the encoder is waiting on the DSP side to
finish as such the algorithm thread gets cpu time and as such the
encode thread arrives to late and takes more time then it should.

av500 suggested to not use SCHED_FIFO as it can be dangerous.

he suggested not to run SCHED_FIFO on threads that are not aware that
they are run in realtime mode, because you said you made all of gst
run as SCHED_FIFO... :slight_smile:

It seems I wasn't very clear in my explanation.
The analyze thread doesn't run in the pipeline itself and takes about
2sec to analyze a frame.
It runs as fast as possible so when it's finishes it signals for a new
copy and then analyzes that copy.

Anyway after some more testing and reading up on scheduling I managed
to get the desired result.
I did switch from SCHED_FIFO to SCHED_OTHER and played with nice
levels of the threads.
This didn't get the desired result at first but after tuning the CFS a
bit it worked out great.

For people stumbling upon this, the parameters I tuned were:
- sched_min_granularity_ns: 500000 (defaulted to 4000000)
- sched_latency_ns: 500000 (defaulted to 20000000)
- sched_wakeup_granularity_ns: 2000000 (defaulted to 5000000)

Ok further testing shows that the CFS tweaking might not be needed at
all.
My last kernel change (that included PREEMPT_RCU) seemed to cause the
largest problem.
I'll update this post once I find the final solution.

Ok my final solution for now is the following:

- schedule everything with SCHED_RR
- the long running analyze thread gets scheduled with SCHED_BATCH and
gets a very high nice level

this works for now.