Problem in Speeding Up Floating Point Computations on Beagle Board

Hi,

I have a code where i require around (10E+6)*(10E+6) number of
floating point multiplication and that much number of floating point
additions as I am doing auto correlation for my development for 10E+6
samples in circular shifting manner

When I try to run my code in beagle board on ARM Core (Since I don't
know how to use DSP Core), I get around 50E+6 multiplications and
additions in greater than 5 minutes which is very slow for my entire
requirement. My BB uses Angstrom and I am not able to use hard FPU.

Can Some body suggest me How to speed up Computation speed on BB so
that I can make a feasible system.

My snippet of the code is as follows: Please help me I am stuck....

The maximum value of k =1000000.

  for(i=1;i<=k;i++)
  {
        sum=0;
        for (j=1;j<=k;j++)
    {
            sum = sum + (*(prdc_pulse_out_store + j))*(*(prdc_pulse_out
+ j));
        }

        *(Rxy + i) = (sum/k);

                // circular shifting
        temp = *(prdc_pulse_out + k);
    for(j=k;j>=2;j--)
    {
         *(prdc_pulse_out + j) = *(prdc_pulse_out + j -1);
    }
    *(prdc_pulse_out + 1) = temp;

  }

Hi,

I have a code where i require around (10E+6)*(10E+6) number of
floating point multiplication and that much number of floating point
additions as I am doing auto correlation for my development for 10E+6
samples in circular shifting manner

When I try to run my code in beagle board on ARM Core (Since I don't
know how to use DSP Core), I get around 50E+6 multiplications and
additions in greater than 5 minutes which is very slow for my entire
requirement. My BB uses Angstrom and I am not able to use hard FPU.

Why? If you aren't using VFP or NEON you will probably find that getting
acceptable performance is simply not possible. Software floating point
is deathly slow. Full stop.

Can Some body suggest me How to speed up Computation speed on BB so
that I can make a feasible system.

Use the hardware you have available to you. I would also recommend using
a library like Eigen[1] (which I can't say enough good things about) if
you are doing lots of linear algebra-like operations.

My snippet of the code is as follows: Please help me I am stuck....

Finally, it should go without saying that the best way to optimize is to
choose a more efficient way to do the needed task. In this case, you are
using a rather inefficient brute-force means of computing an
autocorrelation (with an extremely inefficient implementation, at
that). You should look into computing this via FFT[2]. This is generally
far more efficient and there are some very fast implementations
optimized for ARM[3].

- Ben

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page
[2] http://en.wikipedia.org/wiki/Autocorrelation#Efficient_computation
[3] http://elinux.org/BeagleBoard/GSoC/2010_Projects/FFTW#Project:_NEON_Support_for_FFTW

Thanks Ben for your in valuable suggestions…

the first thing i improved was my the circular shift method using base index and then I removed those iterations in the auto-correlation where one factor was zero, it reduced my time to some extent but when zeroes where not there, it contributed as an over head in time … it was a so-so effort.

The problem why i was not using weiner khinchin theorum was - I wanted an auto correlation of 1 million samples…
and to do auto correlation of 1000000 samples, we need to first calculate FFT of so many samples and then apply.
I thought FFT of so many samples will take time. But I found that even 1048576 point FFT even on BB doess not take that much time … a few seconds and then applying weiner khinchin… i got the entire result in less than 1 minute which was earlier taking around 3 hours or more…

Well the neon option I have not tried yet… but definitely a push from your side has helped me to experiment and good results have been there.

Thanks Ben once again.

regards

mohit