Slow image processing

Hi,

I have written a small program which reads an image from an usb camera and converts it to the hsv-colorspace.
But the Beagle computes only 37 frames per second (99% cpu load). Is this speed normal for the beagleboard? I thought the beagle could be
much faster. The image is only 160x120 pixels.

Here is my program:

    /#include <iostream>
    #include <cmath>
    #include <ctime>
    #include "Webcam.h"
    #include <unistd.h>
    #include <sys/time.h>
         using namespace std;
     
             Webcam * cam = new Webcam("/dev/video0", 160, 120);
             unsigned char * converted = (unsigned char *) malloc(640*480*3);
     
        int cy=0;
        struct timeval time;
        int timeold=0;
                 uint8_t max=0;
        uint8_t min=255;
        uint8_t delta=0;
        uint16_t val;
        uint16_t hue;
        uint16_t sat;
        uint8_t x1;
        uint8_t x2;
        uint8_t x3;
                 while(true){
                         unsigned char * raw = cam->getImagePointer();
                         //RGB to HSV conversion
            for(unsigned int x=0; x<(160*120*3)-3; x+=3){
                                           bool cont=true;
                x1 = raw[x];
                x2 = raw[x+1];
                x3 = raw[x+2];
                 
                max = fmax(x1, fmax(x2, x3));
                min = fmin(x1, fmin(x2, x3));
                delta = max - min;
                                 //cout << "min/max ok" << endl;
                                 //hsv.val
                val=max;
                if(val==0){
                    converted[x]=0;
                    converted[x+1]=0;
                    converted[x+2]=max;
                    //return hsv;
                    cont=false;;
                }
                                 //hsv.sat
                if(cont==true){
                    sat = 255 * delta / val;
                    if(sat==0){
                        converted[x]=0;
                        converted[x+1]=0;
                        converted[x+2]=max;
                        //return hsv;
                        cont=false;;
                    }
                }
                                 //cout << "val/sat ok" << endl;
                if(cont==true){
                    if(delta>0){
                        if(max==x1){ converted[x] = (uint8_t)(0 + 43 * (x2 -
    x3) / delta);
                        }
                        else if(max==x2){
                            converted[x] = (uint8_t)(85 + 43 * (x2 -
    x1) / delta);
                        }
                        else{ //max = rgb.blue
                            converted[x] = (uint8_t)(171 + 43 * (x1 -
    x2) / delta);
                        } }
                    else{
                        converted[x] = 0;
                    }
                                     //cout << "hue ok" << endl;
                 
                    converted[x+2] = val;
                }
                             }
                         gettimeofday(&time, NULL);
            if(time.tv_sec > timeold+1){
                cout << "CPS: " << cy/2.0 << endl;
                timeold=time.tv_sec;
                cy=0;
            }
                         cy++;
                     }
             cam->stopRead();
     
        free(converted);
    }/

Regards, Joern

Hi,

I have no idea about how fast it should be but alot more then
160x120@27 fps I guess.

I would do the following (not sure they are 100% correct):

- Start with measure how fast you can read from the camera, maybe this
is very slow for some reason.

- I noticed that you are using fmax/fmin, they are floating point
functions. But you are only using int. So use Int versions
see http://www.emerson.emory.edu/services/gcc/html/Min_and_Max.html

- What compiler options do you have? VERY IMPORTANT

There are probably 1000 more optimizations, but the above are the most
obvious ones.

Good Luck

arte wrote:

Hi,

I have no idea about how fast it should be but alot more then
160x120@27 fps I guess.

I would do the following (not sure they are 100% correct):

- Start with measure how fast you can read from the camera, maybe this
is very slow for some reason.

I can verify that..

The fmin/fmax take the lions's share..

I changed those to a bunch of if's and restructured the code. That gave
roughly a factor 4.5 speedup.

If you need it even faster get rid of the integer divisions. Since you
only divide by numbers between 1 and 255, so you can create a little
table with reciprocals and use a multiply+shift instead of an divide. In
my tests this gave another factor 2.5 speedup. I got slightly different
results though (rounding issues..)

Here is what I came up with:

// warning: This is C, not C++ code..

static void RGB_to_HSV (uint8_t * __restrict rgb,
                        uint8_t * __restrict hsv)
{
  uint8_t max;
  uint8_t min;
  uint8_t delta;
  uint8_t x1 = rgb[0];
  uint8_t x2 = rgb[1];
  uint8_t x3 = rgb[2];

  hsv[0]=hsv[1]=0;

  min = max = x1;
  if (x2 > max) max = x2;
  if (x2 < min) min = x2;
  if (x3 > max) max = x3;
  if (x3 < min) min = x3;

  hsv[2] = max;
  if(!max) return;

  delta = max - min;
  hsv[1] = 255 * delta / max;

  if(!hsv[1]) return;
  
  if(delta>0)
  {
    if(max==x1) hsv[0] = (0 + 43 * (x2 -x3) / delta);
    else if(max==x2) hsv[0] = (85 + 43 * (x2 -x1) / delta);
    else hsv[0] = (171 + 43 * (x1 -x2) / delta);
  }
}

void convert_faster (uint8_t * __restrict source,
                     uint8_t * __restrict dest, int n)
{
  int32_t i;
  for (i=0; i<n*3; i+=3) RGB_to_HSV (&source[i], &dest[i]);
}

Thanks, for the replys.

arte wrote:

Hi,

I have no idea about how fast it should be but alot more then
160x120@27 fps I guess.

I would do the following (not sure they are 100% correct):

- Start with measure how fast you can read from the camera, maybe this
is very slow for some reason.

- What compiler options do you have? VERY IMPORTANT

I can read very fast from the camera, because I only get a pointer to the image. The image itself is updated 30 times in a second.

Here are my Compiler options for arm-angstrom-linux-gnueabi-g++ :
-O3 -g3 -Wall -c -fmessage-length=0 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize

np: Your code is very fast compared to my version. Now I get ~185fps.

Are there more possiblities to optimize the code? Maybe using neon intrinsics?

Regards, Joern

Rath wrote:

np: Your code is very fast compared to my version. Now I get ~185fps.
  

Nice...

Are there more possiblities to optimize the code? Maybe using neon
intrinsics?
  

Sure.. Lots of possibilities.. As I've written yesterday getting rid of
the divides would be my next step.. That'll give you another factor 2.5
speedup.

NEON is possible as well of course.

If take it serious you could also do the conversion on the DSP. With
some optimization a number between 1000 and 2000 fps is possible. Maybe
even more.

Oh - and then there is the 3D accelerator that could do the job.

Cheers,
  Nils Pipenbrinck

NEON is possible as well of course.

I had to look up what NEON is and how to use it.
http://wiki.omap.com/index.php?title=Cortex_A8

If take it serious you could also do the conversion on the DSP. With
some optimization a number between 1000 and 2000 fps is possible. Maybe
even more.

The DSP:
http://elinux.org/BeagleBoard/DSP_Howto

Oh - and then there is the 3D accelerator that could do the job.

Couldn't really find anything that suggests you could use 3D
acceleration as an aid to processing, but only for video output. It
sounds like the aim of the app is to process and store the data, no?
So you wouldn't be displaying it on the fly... Maybe I just don't
understand.

Do you know of any projects that nicely exemplify coding for NEON, the
DSP, and the 3D accelerator? Thanks!

Andrew Jackman wrote:

Do you know of any projects that nicely exemplify coding for NEON, the
DSP, and the 3D accelerator? Thanks!
  

For NEON there isn't much out there. When I started I just had a list of
the NEON compiler intrinsics and started hacking. Reading compiler
output and talking to the folks on the IRC channel helped a lot. If you
already know SSE or AltiVec a bit it's easy.

For code-examples: This project is very nice:

     http://code.google.com/p/math-neon/

The ffmpeg source code is an excellent read as well.

For OpenGL: There are lots of OpenGL tutorials out there and the
OpenGL>ES 2.0 spec can be downloaded on the khronos site. OpenGL|ES 2.0
is really just a cleaned up version of OpenGL with obscure stuff removed
and a streamlined API.

For DSP.. well - read the TI documentation. Everything is documented
somewhere. The real problem is that there is to much documentation. It's
easy to get overwhelmed by the sheer amount of documentation. Thousands
of pages..

Unfortunately I can't help to get started. Four years ago, when I
started DSP coding for the DaVinci platform, someone from TI came to my
place, installed all nessesary stuff and helped me to get started. On
the beagleboard I just compiled one of the examples from the linuxutils
package and started hacking. For me that was easy because I already knew
what I need and what can be ignored.

Regarding DSP-code itself: The compiler compiles C and C++. You won't
get stellar performance if you don't optimize for the architecture, but
the code will run. The biggest deal will be the DSP<->ARM communication.

When it comes to DSP optimization I found these PDF's to be essential:

Compiler User's Guide:

    http://focus.ti.com/lit/ug/spru187o/spru187o.pdf

C work-alikes for the compiler intrinsics (not complete!)

    http://www-s.ti.com/sc/psheets/spraa75/spraa75.zip

On Hand-Tuning and Optimization (must read!)

    http://focus.ti.com/lit/an/spra666/spra666.pdf

Instruction Set Reference:

    http://focus.ti.com/lit/ug/spru732h/spru732h.pdf

That's ... ehm.. 3000 pages or so of information...

Afterwards check the EDMA3 user's guide (very important if you need high
performance) and the implementation details on the integer division (the
best piece of C64x+ code I've ever seen)

    http://focus.ti.com.cn/cn/lit/ug/spru966b/spru966b.pdf
    http://www.ti.com/litv/pdf/spra707

Don't let the complexity overwhelm you. Start simple and keep hacking.
It took me two years of hard work to really understand the architecture.

Good luck,
  Nils Pipenbrinck

np schrieb:

Sure.. Lots of possibilities.. As I've written yesterday getting rid of
the divides would be my next step.. That'll give you another factor 2.5
speedup.
  
Can you please give me a short example of the calculation without the divide?
Is it right to create an array with floats representing 1/1 .... 1/255? But how should I continue then?

And an other question: I've started to play a bit with the NEON intrinsics. For it, I've tried to implement the algorithm for RGB to greyscale conversation with NEON intrinsics. But I don't know how to get on:

            The code which should use NEON:

            converted[cnt]= (59 * raw[x] + 30 * raw[x+1] + 11 * raw[x+3])/100;
           
            My try:

            uint16x4_t a = {raw[x], raw[x+1], raw[x+2], 0};
            const uint16x4_t cof = {59, 30, 11, 0};
            uint16x4_t result;
                       //Multiply vectors
            result = vmul_u16 (a, cof);

            //Multiply with 0.1 (= /100)
            //How?

            //Sum all Elements of the result vector and save in a scalar
            //How to solve?

Regards, Joern

Rath wrote:

np schrieb:
  

Sure.. Lots of possibilities.. As I've written yesterday getting rid of
the divides would be my next step.. That'll give you another factor 2.5
speedup.
  
Can you please give me a short example of the calculation without the divide?
Is it right to create an array with floats representing 1/1 .... 1/255? But how should I continue then?
  

See: http://www.cs.uiowa.edu/~jones/bcd/divide.html for a decent explanation of fixed point integer arithmetic.

Basically make a table for your fractions 1/1 to 1/255. Use whatever precision you want, 16 bits is probably heaps.
So assuming your table represents your fractions as 16 bits then X Divide by Y then becomes something like

X = (X * DivTable[Y]) >> 16.

Alternatively as you have 2 values, each in a range of 256 to divide, you could just pre-compute a 2d array 256x256 with the answers, then the result becomes:

X = Divtable[X,Y];

and in your specific example:

sat = 255 * delta / val;

do something like this:
uint16_t sat_div_table[256][256]; // static so its contents arent forgotten between procedure calls.

sat = sat_div_table[delta,val];

Your table will be 256x256x2 bytes big = 128Kb, which isnt too bad.

init it with a loop like this:

for (d = 0; d < 256; d++) {
  for (v = 0; v < 256; v++} {
    sat_div_table[d,v] = 255*d/v;
  }
}

I didn't analyse your code too much, and just went with the types you declared, if sat only needs to be 8 bits then your table can be half this size.

Obviously you init once at the beginning of the program, and you need to make sure the table values remain while you need them (static or global scope, or malloc the space on the heap).

Note, this may not be faster than calculating, as the table will blow the caches so each lookup will be a cache miss in most likely hood. you need to benchmark it in your app to see.

Alternatively as you have 2 values, each in a range of 256 to divide,
you could just pre-compute a 2d array 256x256 with the answers, then the
result becomes:

X = Divtable[X,Y];

Woundt this ruin the cache?
A table of 64k is alot bigger then the cache.
So table values will be fetched from the L2 cache, and this takes a
couple of cycles so maybe this takes longer then the actual divide.
I think the A8 has a hardware divider.

Maybe
  delta = max - min;
  hsv[1] = 255 * delta / max;
Would be best using NEON and several in paralell.

Maybe you could have a lookup for some values and division for some,
this would save the cache.

  delta = max - min;
  hsv[1] = 255 * delta / max;

Since I guess the delta variable is not a uniform distribution between
0-255. (because its a differance between two values)
So something like

if( delta > 32)
{
    hsv[1] = 255 * delta / max
}
else
{
   hsv[1] = tab[delta][max];
}

Same thing could be done with the other division but with other table.

Maybe a table of 50k or something?

Any clues about the latency for fetching rgb values? Are these in some
shared memory?
If not, maybe the SDMA could be used to improve performance.