C66x resource scheduling

I wonder how much sophisticated resource scheduling regarding C66x DSP cores can be. For example, i estimated that code of my project's software would need at least 2 cores' entire processing capacity (according to TI documentation) to be able to work in hard real-time. But then i have so much untapped processing power that can be used by third party plugins. Can i partition workload on DSP's like in generic-purpose CPU?

If you use TI’s RTOS, then yes.

Regards,
John

And if not? Can i partition DSP cores using Linux?

You can operate them independently. You use Linux to load the DSP firmware for each core and then take them out of reset. The same goes for the CortexM4 and PRUs.

Regards,
John

By firmware you mean C program code?

On Mon, 26 Jun 2017 04:16:23 -0700 (PDT), MDX
<speedy1024@gmail.com> declaimed the
following:

By firmware you mean C program code?

  Only loosely... Since one doesn't have to use C to produce it...

  The loadable, binary, image that is built for that processor. Depending
upon the task, it is just possible that the same source code might be
usable on each of the processors, as long as the build tools were commanded
to use the correct tool-chain (source -> C66x image vs source -> M4 image
vs source -> PRU image).

  These are absolute images, there is no OS with them (unless, as
mentioned you build something like TI-RTOS into the image [or port FreeRTOS
to the processor]). Linux runs on the main application processor and is
used to load the images onto the ancillary processors, after which they run
independently.

I don't know what I can assume about the context of your question, so
please forgive me if guess wrongly and explain things that are obvious to
you or, conversely, talk over your head---please clarify in response.
You asked about coordination and partitioning of DSP coprocessors, which to
me implies a unified system that manages both the main CPU and the
coprocessors on a similar plane. This is not how Linux sees the
coprocessors: while the CPU cores are indeed managed as an ensemble by the
kernel, the coprocessors are seen as peripheral resources. Linux kernel may
help loading them with their binary firmware, and starting and stopping
them, but you need to write all the synchronization primitives yourself.
The DSPs (and PRUs, etc) will be running their own code but will not have
the usual set of Linux system calls (unless you somehow implement them
yourself in your firmware). You can write the DSP/PRU code in their
respective assembler, or in the C variant understood by the DSP/PRU C
compiler, which is different than the C compiler used by the main CPU.
The newer kernels provide more functionality via the RemoteProc subsystem,
but it's just a way of managing high-level stuff like firmware loading,
starting/stopping, and communication with the coprocessors. You still can't
manage the computational tasks homogenously, e.g. moving them forth and
back between the CPU and coprocessors---instead, you need to split up
processing tasks between the CPU and coprocessors, write the code, deploy
and start the binaries, and communicate and synchronize their operation.

Well, you guessed (almost) correctly and your answer pretty much fills my list of requirement. Now i wonder if that "high-level communication" is enough to control resource sharing between C66x software processing samples in programmed order, so that it stays in real-time

First you have to define what you mean by “real-time”. To most it means completing a task in a well defined amount of time. As long as you don’t turn off interrupts for extended periods, there is no reason why the responsiveness isn’t deterministic.

Coordinating between the DSPs is something you have to setup in your software. You can do this via shared memory, interrupts, etc.

BTW, the TI C-compiler used for the DSPs is extremely good and in many cases is more efficient than programming in assembly. Only the most highly skill DSP programmer will be able to achieve better assembly performance vs C-code.

Regards,
John

Well, samples are part of stream that must "return" to output in-order and with a minimal latency, and i never planned to use asm

So what is the maximum latency to can tolerate?

Remember, each DSP has 8 functional units (two multiply units and 6 ALUs) which means you can execute up to 8 instructions per cycle.

Here are some core benchmarks:

http://www.ti.com/lsds/ti/processors/technology/benchmarks/core-benchmarks.page

Regards,
John

Output must be a constant stream upon any input and 10ms is considered as a critical issue

Well, that depends on how complex your algorithm is, but in most cases, 10ms latency should not be a problem.

Regards,
John