Use C6Runlib to do image processing?

Dear all,

I am recently try to do some image processing on OMAP3530. I am trying
to use C6Runlib to leverage the DSP.

However, when I run a basic DSP example program of FFT in C6Runlib,
and find out the DSP seems to be slower than GPP.

# ./cfft_arm
N=16,nTimes=100: 0.000854 s
N=32,nTimes=100: 0.002228 s
N=64,nTimes=100: 0.005158 s
N=128,nTimes=100: 0.012178 s
N=256,nTimes=100: 0.02759 s
N=512,nTimes=100: 0.062043 s
N=1024,nTimes=100: 0.138458 s
N=2048,nTimes=100: 0.305908 s
N=4096,nTimes=100: 0.671295 s
N=8192,nTimes=100: 1.46374 s
N=16384,nTimes=100: 3.14758 s

# ./cfft_dsp
N=16,nTimes=100: 0.031891 s
N=32,nTimes=100: 0.045929 s
N=64,nTimes=100: 0.079041 s
N=128,nTimes=100: 0.158692 s
N=256,nTimes=100: 0.335083 s
N=512,nTimes=100: 0.732422 s
N=1024,nTimes=100: 1.60788 s
N=2048,nTimes=100: 3.53708 s
N=4096,nTimes=100: 7.72754 s
N=8192,nTimes=100: 16.8759 s
N=16384,nTimes=100: 36.7068 s

My guess is the c6runlib doesn't use the 8 multi computation unit on
DSP simultaneously, or the memory access of DSP takes a long time?

I am really confused by this result, hope someone could come out and
put some insight on it.

Thanks a lot!

The original FFT code is like following:

/* Code originally taken from the following URL:
     http://svn.arhuaco.org/svn/src/emqbit/tools/emqbit-bench/
*/

/*
* Authors:
* Jorge Victorino
* Andres Calderon andres.calderon@emqbit.com

Are you sure that the code is even executed on the DSP? The benchmark
you took look almost identical between DSP and GPP.

Also note that the DSP can't do floating point arithmetic in hardware,
so you will only see good performance if you use fixed point arithmetic.

Cheers,
    Nils

Hi Nils.

The figures are not the same, I must admit that is what I thought when I first looked at them but in fact they are just formatted badly!

I think the fixed/float issue is the real problem as you say.

Cheers

Andy

Dear Nils,

Thanks for pointing out the fixed point thing.

Actually I am really not sure what the c6run compiler exactly did in
this example. I think C6run is designed for beginner, so I will
explain a little about it, in case you have never used it.

The basic idea is that on the host computer, c6run firstly compile the
cfft.c to cfft.lib which can be executed on DSP side of target
machine, then cfft.lib is compiled together with main_cfft.c and
produce a executable file "cfft_dsp" which can be execute on target
machine. In this way ARM can use DSP to do critical computation. That
should be what cfft_dsp did in the example. And about cfft_arm, it
just use ARM Compiler to compile all the files. So this is the reason
why cfft_arm and cfft_dsp looks the same. What really confusing me is
the result from DSP is 10 time slower than ARM.

So I guess to really use DSP, I might have to use DSPLink, do you have
any recommendation on where I can begin with to deal with the
communication between ARM and DSP? Actually I have a little DSP
programing experience on CCStudio, and I found out it is completely
different when working on ARM+DSP architecture.

following is the main_cfft.c, in case you want to see it.

/* Code originally taken from the following URL:
     http://svn.arhuaco.org/svn/src/emqbit/tools/emqbit-bench/
*/

/*
* Authors:
* Jorge Victorino
* Andres Calderon andres.calderon@emqbit.com

Ooops.. Right. The table was confusing.

For what it's worth, I did a benchmark on the 16x16 complex fixed point
FFT (libdsp from TI) some month ago. Here are the results for the DSP
runing at 360Mhz.

    8 0.58 �s
   16 0.30 �s
   32 0.58 �s
   64 0.96 �s
  128 2.14 �s
  256 4.03 �s
  512 9.11 �s
1024 17.7 �s
2048 40.3 �s
4096 80.1 �s
8192 0.86 ms
16384 10 ms
32767 21 ms
65536 44 ms

Just to give the OP an idea what performance to expect.

Cheers,
  Nils

Dear Nils,

Thanks for pointing out the fixed point thing.

Actually I am really not sure what the c6run compiler exactly did in
this example. I think C6run is designed for beginner, so I will
explain a little about it, in case you have never used it.
  

I have never used c6run. It looks like a nice project to get started though

the result from DSP is 10 time slower than ARM.
  

That’s fixed point vs. floating point.

For a test you could replace the complex definition to use int instead of float and re-run the test. The numeric results will be wrong due to the lack of scaling and overflows, but the performance difference should change drastic and will roughly reflect what you’ll get with fixed-point.

So I guess to really use DSP, I might have to use DSPLink, do you have
any recommendation on where I can begin with to deal with the
communication between ARM and DSP?

Well, this is the hard part that c6run should make easy… I would stick with c6run.

If you want to you can take a look at my minimal DspLink example that contains all the basic stuff you need to invoke dsp-code from arm. You will have to rework all the makefiles because the paths are hard-coded to my system, but it may help you to get started:

I do some evil stuff in the code, resetting the DSP-MMU via /dev/mem, having cmem in-tree ect. It works well, but it’s not not supported by me. Use it at your own risk.

Cheers,
Nils

c6accel would be a better choice for imageprocessing. c6run is a toy

Dear Nils,

I modified the float type to int as you said, and DSP speed up a
little, but still 2 times slower than GPP.
Is there any other possible stupid mistake that I can make?

Thank you for posting your dsplink example here

root@overo:~# ./cfft_arm
N=16, nTimes=100: 0.000275 s
N=32, nTimes=100: 0.000854 s
N=64, nTimes=100: 0.001374 s
N=128, nTimes=100: 0.003083 s
N=256, nTimes=100: 0.00647 s
N=512, nTimes=100: 0.014038 s
N=1024, nTimes=100: 0.030578 s
N=2048, nTimes=100: 0.06839 s
N=4096, nTimes=100: 0.163361 s
N=8192, nTimes=100: 0.362976 s
N=16384,nTimes=100: 0.791535 s

root@overo:~# ./cfft_dsp
N=16, nTimes=100: 0.018402 s
N=32, nTimes=100: 0.018463 s
N=64, nTimes=100: 0.018555 s
N=128, nTimes=100: 0.022278 s
N=256, nTimes=100: 0.033722 s
N=512, nTimes=100: 0.045746 s
N=1024, nTimes=100: 0.070252 s
N=2048, nTimes=100: 0.125275 s
N=4096, nTimes=100: 0.247071 s
N=8192, nTimes=100: 0.623688 s
N=16384,nTimes=100: 1.7861 s

Dear Koen,

I found out that to install c6accel I also need to have TI DVSDK
installed
I will try it, thanks.

Maybe. If you send me the object file I will can take a look if there is
something unwanted going on.

Otherwise: The FFT you're benchmarking is not optimized for the DSP. You
will get much better performance if you call the dsplib FFT from TI.

Cheers,
    Nils

Dear Nils,

Thanks. Can you read obj files?

https://docs.google.com/leaf?id=0BxxM4bSq7fUtMDI5ZTI1YTMtMjdiZC00N2JjLWJlZDEtZTY4NzFhYTcxNjQw&hl=en

Also, I am looking at a useful link,which seems giving the same
suggestion as you did. I am still trying to use the dsplib.
http://e2e.ti.com/support/dsp/omap_applications_processors/f/447/p/70317/255175.aspx

Thank you so much!