Beegle board Performence Issue

Hi,

I try to develop an image viewer application to view RGB and BitMap.
My target is Beegle board and Kernal is Angstrom

I have two set of source code (All versions are in Fixed Point)

Version 1 : Pure ANSI C

Only the C code is considered
The make file is given below

OBJFILES = # objfiles.o
INCLUDE = -I./../Header
ABC=arm-none-linux-gnueabi-gcc
PQR=-march=armv7-a -mtune=cortex-a8
CFLAGS = -O3 -Wall $(INCLUDE) $(PQR)
HOME = IMViewer.so
$(HOME) : $(OBJFILES)
$(ABC) -o $@ $^ $(CFLAGS) -fPIC -L. -shared
${ABC} -o IMView $(OBJFILES) -ldl -L. -lIMViewer
install $(HOME) ../lib/
mv IMView ../lib/
rm -rf IMViewer.so
@echo "C Version completed..."
%.o : %.c
$(ABC) -c $(CFLAGS) $< -o $@

Version 2 : C + Neon Intrinsics

In this version i use the neon intrinsics where ever applicable
and the resulting source is mixed with C and Neon intrinsics
The make file used for compiling this is given below

OBJFILES = # objfiles.o
INCLUDE = -I./../Header
ABC=arm-none-linux-gnueabi-gABC
PQR=-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -
ftree-vectorizer-verbose=2 -flax-vector-conversions
CFLAGS = -O3 -Wall $(INCLUDE) $(PQR)
HOME = IMViewer.so
$(HOME) : $(OBJFILES)
$(ABC) -o $@ $^ $(CFLAGS) -fPIC -L. -shared
${ABC} -o IMView $(OBJFILES) -ldl -L. -lIMViewer
install $(HOME) ../lib/
mv IMView ../lib/
rm -rf IMViewer.so
@echo "C Neon Version completed..."
%.o : %.c
$(ABC) -c $(CFLAGS) $< -o $@

Hope u get my real set up

Then in my IMViewer application i take the performence of both
versions
the code fragment is given below

#include<stdio.h>
#include <sys/time.h>
long st = 0,et = 0;
struct timeval First, Last;
void main(int argc, char**argv)
{
     gettimeofday(&First, NULL);
  st = (First.tv_sec * 1000) + (First.tv_usec/1000) ; /* Time In
Mill Second Unit */

  IMViewer();

  gettimeofday(&Last, NULL);
  et = (Last.tv_sec * 1000) + (Last.tv_usec/1000) ; /* Time In
Mill Second Unit */
  printf("The Effective time in Millisecond is %d",(et - st));

}

This code fragment is working in common for two versions to take the
time to complete .

But sadly the performence for version 2 is not good. It is near to C
version. I don't spot
what is the problem here !

I did the checking the following cases and it is Ok

1. OS Kernal is NEON enabled (OMAP 3530)
2. In the generated assembly files of NEON code there is assembly
instruction of neon intrinsics

Following doubts still exists

1. Will i can configure the L1 and L2 cache size of OS kernal?
2. Is there any hand written assembly is needed for enable the Neon
processor of beegle board

Kindly look in to my issue and please help me !

Tool chain gcc version is 4.2.1

Rgds
Dave