Hi,

I have a code where i require around (10E+6)*(10E+6) number of

floating point multiplication and that much number of floating point

additions as I am doing auto correlation for my development for 10E+6

samples in circular shifting manner

When I try to run my code in beagle board on ARM Core (Since I don't

know how to use DSP Core), I get around 50E+6 multiplications and

additions in greater than 5 minutes which is very slow for my entire

requirement. My BB uses Angstrom and I am not able to use hard FPU.

Can Some body suggest me How to speed up Computation speed on BB so

that I can make a feasible system.

My snippet of the code is as follows: Please help me I am stuck....

The maximum value of k =1000000.

for(i=1;i<=k;i++)

{

sum=0;

for (j=1;j<=k;j++)

{

sum = sum + (*(prdc_pulse_out_store + j))*(*(prdc_pulse_out

+ j));

}

*(Rxy + i) = (sum/k);

// circular shifting

temp = *(prdc_pulse_out + k);

for(j=k;j>=2;j--)

{

*(prdc_pulse_out + j) = *(prdc_pulse_out + j -1);

}

*(prdc_pulse_out + 1) = temp;

}

Hi,

I have a code where i require around (10E+6)*(10E+6) number of

floating point multiplication and that much number of floating point

additions as I am doing auto correlation for my development for 10E+6

samples in circular shifting manner

When I try to run my code in beagle board on ARM Core (Since I don't

know how to use DSP Core), I get around 50E+6 multiplications and

additions in greater than 5 minutes which is very slow for my entire

requirement. My BB uses Angstrom and I am not able to use hard FPU.

Why? If you aren't using VFP or NEON you will probably find that getting

acceptable performance is simply not possible. Software floating point

is deathly slow. Full stop.

Can Some body suggest me How to speed up Computation speed on BB so

that I can make a feasible system.

Use the hardware you have available to you. I would also recommend using

a library like Eigen[1] (which I can't say enough good things about) if

you are doing lots of linear algebra-like operations.

My snippet of the code is as follows: Please help me I am stuck....

Finally, it should go without saying that the best way to optimize is to

choose a more efficient way to do the needed task. In this case, you are

using a rather inefficient brute-force means of computing an

autocorrelation (with an extremely inefficient implementation, at

that). You should look into computing this via FFT[2]. This is generally

far more efficient and there are some very fast implementations

optimized for ARM[3].

- Ben

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page

[2] http://en.wikipedia.org/wiki/Autocorrelation#Efficient_computation

[3] http://elinux.org/BeagleBoard/GSoC/2010_Projects/FFTW#Project:_NEON_Support_for_FFTW

Thanks Ben for your in valuable suggestions…

the first thing i improved was my the circular shift method using base index and then I removed those iterations in the auto-correlation where one factor was zero, it reduced my time to some extent but when zeroes where not there, it contributed as an over head in time … it was a so-so effort.

The problem why i was not using weiner khinchin theorum was - I wanted an auto correlation of 1 million samples…

and to do auto correlation of 1000000 samples, we need to first calculate FFT of so many samples and then apply.

I thought FFT of so many samples will take time. But I found that even 1048576 point FFT even on BB doess not take that much time … a few seconds and then applying weiner khinchin… i got the entire result in less than 1 minute which was earlier taking around 3 hours or more…

Well the neon option I have not tried yet… but definitely a push from your side has helped me to experiment and good results have been there.

Thanks Ben once again.

regards

mohit