image operations on DSP

Hi all,

I'm investigating use of the OMAP platform for some realtime computer
vision applications. My guess is that the CPU, even with the NEON
instructions, will probably be slower than the DSP for operations such
as absolute difference between images (approx. size 640x480 pixels
uint8), maximum detection, spatial moments calculation, and non-max
suppression. The questions are:

* would you recommend learning the DSP in order to implement such
algorithms?
* or maybe NEON is almost as fast?
* is there any hope that using GLSL to do this on the GPU would as fast?
* is there any project that has already made some progress on this that
I could look at?

Thanks,
Andrew

Andrew Straw wrote:

Hi all,

I'm investigating use of the OMAP platform for some realtime computer
vision applications. My guess is that the CPU, even with the NEON
instructions, will probably be slower than the DSP for operations such
as absolute difference between images (approx. size 640x480 pixels
uint8), maximum detection, spatial moments calculation, and non-max
suppression.

Hi Andrew.

It's an efford/performance trade-off.

In a nutshell the DSP can do image processing faster than the NEON unit,
but it's harder to get good performance out of it. NEON is much more
straight forward.

Long explanation:

For good performance out of the NEON unit you have to write assembler.
Unfortunately there is no way around this. The current GCC compiler does
a very bad job at optimizing NEON code, even if you use intrinsics. On
the other hand once you've written the proessing functions you're done.
No extra steps required.

The DSP can be significant faster. You write you algorithms in C with
DSP intrinsics. No need to learn assembler. Also compared to NEON, the
DSP intrinsics are a easier to use. TI even has a nice c-library that
simulates the DSP intrinsics, so you can debug DSP-code on a PC.
However, there are two important caveats for the DSP code:

1. For good performance you have to understand the optimizer comments in
the assembler-output and use them to tune your code. It takes a lot of
time to understand the architecture, learn how to distribute the
workload on the eight execution units and get a feeling where there
practical performance limit is. Writing near optimimal DSP code is not
straight forward.

2. Even with optimized C you will only see 1/4 (over the thumb) of the
peformance of the NEON unit. The DSP on the OMAP3 is very slow when it
has to access external memory. It flies when it accesses internal ram
though.

To see real benefits you have to program the EDMA controller and do all
memory transfers between main-memory and internal memory in parallel to
your pixel processing. This is hard to get right, but well worth it
because once you get it right you will see the full 8 DMIPS/Mhz of DSP
performance and never a cache-miss that ruins performance.

As an example: I'm currently working on an high quality bayer demosaic
function for computer vision. I have two DMA streams running non-stop
and I have to juggle with ten different work-buffers in internal ram. I
can process 153 megapixel/second or 583 Mb of pixel data per second that
way (measured with a DSP-clock of 360Mhz). The ARM can't even do a
memset that fast.

For extreme pixel crunching you can use ARM and DSP in parallel of course.

* is there any hope that using GLSL to do this on the GPU would as fast?
  

Sure! You can store your input data in textures, do some pixel shader
processing and read back the result via OpenGL. With some hacking you
may even be able to bypass the OpenGL image upload/download functions
and directly write into the memory regions. It shouldn't be to hard to
find the physical addresses.

* is there any project that has already made some progress on this that
I could look at?
  

I've written something on NEON pixel processing on my blog a couple of
weeks ago (I've mentioned that before, didn't I?). It's not much but a
start: hilbert-space.de - This website is for sale! - hilbert space Resources and Information. The ffmpeg source is also a wealth
of information. It contains lots of ARM-NEON assembler functions to
analyze and learn from.

Cheers,
  Nils Pipenbrinck

Btw:
<shameless plug>
I have four years of experience in C64+ DSP image processing and I'm
available for contract and consulting work. If anyone is interested in
such services just write me a mail..
</shameless plug>