SGX libs with hardfp support?

Hi,

i have been maintaining the meego adaptation for the beagle board for quite some time now. MeeGos
official arm port recently switched to hardfp. Unfortunately this broke all adaptations relying on the
SGX drivers available from TI (the n900 adaptation which drove the hardfp switch has its own SGX libs
made by Nokia which unfortunately won't work on the beagle).

So is there a chance to get SGX libs with hardfp support? Rumors are that other OSs are also considering
switching to hardfp (e.g. ubuntu), so the same need will arrive there as well ...

Regards,
  Till

"Till Harbaum / Lists" <lists@harbaum.org> writes:

So is there a chance to get SGX libs with hardfp support?

arm-foo-blah-strip -R .ARM.attributes libsgx.so

Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.

13.06.2011 22:26, M�ns Rullg�rd kirjoitti:

"Till Harbaum / Lists" <lists@harbaum.org> writes:

So is there a chance to get SGX libs with hardfp support?

arm-foo-blah-strip -R .ARM.attributes libsgx.so

Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.

That is not really a solution. There is no support from TI for this and
it's a real problem. If major ARM Linux distributions are going hardfp
then we need vendor supported libraries also.

-- Antti

I believe that the angstrom distribution is compiled as hardfp and does include SGX libs.

Did you check that distribution already?

Greetings,

Han

I believe that the angstrom distribution is compiled as hardfp and does
include SGX libs.

I do not think the distributed versions of Angstrom are hardfp. It is
possible for you to build a hardfp version of Angstrom yourself
though.

Philip

Antti Kaijanm�ki wrote:

13.06.2011 22:26, M�ns Rullg�rd kirjoitti:

"Till Harbaum / Lists"<lists@harbaum.org> writes:

So is there a chance to get SGX libs with hardfp support?

arm-foo-blah-strip -R .ARM.attributes libsgx.so

Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.

That is not really a solution. There is no support from TI for this and
it's a real problem. If major ARM Linux distributions are going hardfp
then we need vendor supported libraries also.

why would they be going hardfp then if there is no support for it?
the only hardware with supported hardfp atm is the N900, no?

Who's "they"? In the MeeGo case it's Nokia and the n900 adaptation team that
switched to hardfp. Once they did that, they disabled all the builds for
softfp which the other community adaptions relied upon.

So if i insist on continuing using softfp i'd have to compile for softfp
myself (or maintain a separate softfp build in MeeGo OBS). I don't think
that's the way to go as any problem introduced by this needs to be solved
by myself. I'd prefer to use as much common MeeGo userland as possible
just to take advantage of the fact that this is running through Nokia QA.

Till

The SGX libs are responsible for the GLES stuff and i actually expect them to
use floats intensively. But i haven't verified that. May be worth a try ...

Thanks,
  Till

"Till Harbaum / Lists" <lists@harbaum.org> writes:

So compiling the exact same sources for a different ABI means no QA check from nokia applies anymore *at all*? I thought being about to change ABI was the whole point of that OBS thingie.

"Till Harbaum / Lists" <lists@harbaum.org> writes:

A softfp compiled MeeGo isn't even MeeGo compliant and must not use the name MeeGo.

Till

Hi,

arm-foo-blah-strip -R .ARM.attributes libsgx.so

Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.

Ok, tried that (since someone else suggested that he had success with that).
The libs load and the apps run, but no polygons nor any textures are visible
which is easily explained by a mismatch in the way floats are forwarded between
the app and the libs. In the meantime that person also found out that his setup
isn't really working ...

So there doesn't seem to be a way around hardfp compiled SGX libs here.

Anyone from TI listening? We need your support!

Till

If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?

Hi,

If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?

Going back to softfp has several disadvntages i'd like to avoid:

- It may trigger hidden bugs which i'd have to resolve
- I'd have to recompile the entire MeeGo myself
- The resulting system would not be able to use programs from the repositories
- The resulting system would not make use of it's hardware floating point support

So, softfp is an option i'd really like to avoid.

Till

Hi,

If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?

Going back to softfp has several disadvntages i'd like to avoid:

[..]

- The resulting system would not make use of it's hardware floating point support

At this point I need to really tell you to do your homework better instead if speading nonsense. With -mfpu={vfpv3-d16,neon} -mfloat-abi you *will* get vfp and/or neon instructions and those *will* use the hardware blocks. It will just use a suboptimal calling convention that has no proven real world benefit besides synthetic benchmarks and povray.

Since you have the basics of the issue all wrong, it is possible that your whole quest for hardfp libs is wrong as well?

Hi,

"Till Harbaum / Lists" <lists@harbaum.org> writes:

Hi,

arm-foo-blah-strip -R .ARM.attributes libsgx.so

Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.

Ok, tried that (since someone else suggested that he had success with that).
The libs load and the apps run, but no polygons nor any textures are visible
which is easily explained by a mismatch in the way floats are forwarded between
the app and the libs. In the meantime that person also found out that his setup
isn't really working ...

So there doesn't seem to be a way around hardfp compiled SGX libs here.

You could easily create a few wrappers for those functions that need
them.

I can in theory provide a batch of hard-float SGX OpenGL libraries --
unsupported -- but probably only for OMAP4. (I might be able to get
the code drop that I have to build for OMAP3, or extract from TI a
similar code drop for the OMAP3 family -- we have uses for such a
thing as well -- but I'm afraid that's speculative at best.) More to
the point, I can help provide justification for the request.

While Måns is right that you could technically create hardfp/softfp
wrappers with a bit of assembly fancy dancing, that would have to be
done for all APIs (not just those which pass floating point parameters/
results) and would have terrible performance (especially on Cortex-A8,
where moving a VFP register to an ARM register stalls the entire ARM
for 20 cycles or so). That's because the softfp calling convention
permits the callee to smash essentially *all* FPU state, while the
hardfp convention is callee-save for most VFP/NEON registers (d8 and
up plus a subset of flags). So those wrappers would have to save all
FPU state that the hardfp API considers callee-save, whether or not
the called function uses the FPU at all -- unless, of course, you are
willing to run the OpenGL libraries through some sort of binary static
analysis in order to find which FPU state each API touches. Ouch!

And while Koen is right that the hardfp calling convention does not
yet have much in the way of benchmark support -- and is arguably sub-
optimal if your floating-point operations are concentrated inside
innermost C functions -- I expect that will change as GCC gets better
at using the NEON unit for integer SIMD and vectorized load/store
operations. Especially on Cortex-A9 and later cores -- which don't
have the severe penalty for inter-pipeline transfers, and do have
dedicated lanes to memory for the NEON unit -- the compiler can
tighten up the execution of rather a lot of code by trampolining
structure fetches and stores through the NEON. If, that is, it can
schedule them appropriately to account for latencies to and from
memory as well as the (reduced but non-zero) latency of VFP<->ARM
transfers. The softfp ABI interferes with this by denying the
compiler the privilege of rescheduling NEON instructions across a
function call -- even one that doesn't actually use any floating
point. (Any function call to which the ABI applies, anyway; which
doesn't include static C functions, I think, but does include all C++
instance methods even if they get inlined -- if I remember the spec
correctly.)

I should be able to produce some benchmark data in support of this
argument in the next month or so. If you want to check it out for
yourself, I suggest that you use the system at https://github.com/mkedwards/crosstool-ng
to produce a pair of toolchain/userland environments which differ only
in the choice of float ABI and are otherwise adapted to your chip.
The arm-cortex_a8-linux-gnueabi sample is probably pretty close to
what you need, although you'll need to remove clips-core, xmlrpcpp,
and ltrace from the config, and you might need to fix up a few
download URLs. You might also look at arm-cortex_a9-linux-gnueabi,
which uses the hard-float ABI and the actively developed Linaro GCC
4.6 branch, to which parts of the autovectorized NEON load/store have
been ported from GCC mainline. The toolchain's default fpu is
specified in the config file -- those samples are built with neon-d16-
fp16, for reasons I can explain if you're interested -- but you should
be able to use -mfpu to get whatever tunings you like (don't forget -
ffast-math if you really want NEON floating point).

This system constructs a fairly complete sysroot, with busybox for
most of the command-line utilities, plus a debug-root overlay that
adds debug tools (very lightly tested). Drop it onto some storage you
can see from your BeagleBoard, bind-mount /dev, /dev/pts, /proc, /
sys, /tmp, and maybe /var, and chroot in. Add whatever benchmark you
like (cross-compiled with the same toolchain, of course) and run.
Repeat with alternate float ABIs, cpu tunings, toolchain patches,
etc. Let me know what you learn. :wink: (#linaro on irc.freenode.net
is a good place.)

Cheers,
- Michael

"Edwards, Michael" <m.k.edwards@gmail.com> writes:

While Måns is right that you could technically create hardfp/softfp
wrappers with a bit of assembly fancy dancing,

There is an even simpler way. Declaring all functions with floating-point
parameters or return values as variadic will force soft-float parameter
passing when calling these. See the AAPCS (IHI0042D) section 6.4.1:

  6.4.1 VFP and Base Standard Compatibility

  Code compiled for the VFP calling standard is compatible with the base
  standard (and vice-versa) if no floating-point or containerized vector
  arguments or results are used, or if the only routines that pass or
  return such values are variadic routines.

that would have to be done for all APIs (not just those which pass
floating point parameters/ results) and would have terrible
performance (especially on Cortex-A8, where moving a VFP register to
an ARM register stalls the entire ARM for 20 cycles or so).

The performance would be no more terrible than that of a system built
with softfloat calls using the libraries unaltered, and the performance
of such systems is apparently adequate.

That's because the softfp calling convention permits the callee to
smash essentially *all* FPU state,

Where did you get that notion. There is nothing in the ARM ABI docs to
support it. In fact, the paragraph quoted above directly contradicts
your claim.

while the hardfp convention is callee-save for most VFP/NEON registers
(d8 and up plus a subset of flags).

D16-D31 are caller-saved.

So those wrappers would have to save all FPU state that the hardfp API
considers callee-save,

Which is _exactly the same_ as the softfp. The AAPCS defines the
caller/callee-saved aspects independently of parameter passing.

whether or not the called function uses the FPU at all -- unless, of
course, you are willing to run the OpenGL libraries through some sort
of binary static analysis in order to find which FPU state each API
touches. Ouch!

Nice straw man.

And while Koen is right that the hardfp calling convention does not
yet have much in the way of benchmark support

Are you implying there is some not yet benchmarked case where it
performs significantly better?

-- and is arguably sub- optimal if your floating-point operations are
concentrated inside innermost C functions --

Using VFP register parameters (i.e. doing nothing) is never less
efficient than moving them to core registers (doing something).

I expect that will change as GCC gets better at using the NEON unit
for integer SIMD and vectorized load/store operations.

Are you saying increased use of NEON by gcc will make hardfp calls
slower?

Especially on Cortex-A9 and later cores -- which don't have the severe
penalty for inter-pipeline transfers,

The A9 and later indeed make the softfp calls less costly, reducing any
advantage hardfp might have (which is already small in benchmarks on A8).

and do have dedicated lanes to memory for the NEON unit

No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.

-- the compiler can tighten up the execution of rather a lot of code
by trampolining structure fetches and stores through the NEON.

Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.

If, that is, it can schedule them appropriately to account for
latencies to and from memory as well as the (reduced but non-zero)
latency of VFP<->ARM transfers.

The out of order issue on A9 and later makes most such tricks unnecessary.

The softfp ABI interferes with this by denying the compiler the
privilege of rescheduling NEON instructions across a function call
-- even one that doesn't actually use any floating point.

To the extent scheduling across function calls is permitted by the C
standard, the manner of passing parameters has no bearing on such
optimisations.

(Any function call to which the ABI applies, anyway; which doesn't
include static C functions, I think, but does include all C++ instance
methods even if they get inlined -- if I remember the spec correctly.)

If a function is fully inlined, the compiler can of course do whatever
it pleases. That is the entire point of inlining.

I should be able to produce some benchmark data in support of this
argument in the next month or so.

You must have a unique approach to benchmarking if it produces results
contradicting everybody else's. Have you considered patenting your
methods?

(don't forget -ffast-math if you really want NEON floating point).

-ffast-math should only be used with extreme caution as it will give
vastly different results in many cases. Specifically, anything relying
on infinities or NaN values becomes unpredictable, and operations with
very large or very small numbers may lose precision.