"Edwards, Michael" <m.k.edwards@gmail.com> writes:
While Måns is right that you could technically create hardfp/softfp
wrappers with a bit of assembly fancy dancing,
There is an even simpler way. Declaring all functions with floating-point
parameters or return values as variadic will force soft-float parameter
passing when calling these. See the AAPCS (IHI0042D) section 6.4.1:
6.4.1 VFP and Base Standard Compatibility
Code compiled for the VFP calling standard is compatible with the base
standard (and vice-versa) if no floating-point or containerized vector
arguments or results are used, or if the only routines that pass or
return such values are variadic routines.
that would have to be done for all APIs (not just those which pass
floating point parameters/ results) and would have terrible
performance (especially on Cortex-A8, where moving a VFP register to
an ARM register stalls the entire ARM for 20 cycles or so).
The performance would be no more terrible than that of a system built
with softfloat calls using the libraries unaltered, and the performance
of such systems is apparently adequate.
That's because the softfp calling convention permits the callee to
smash essentially *all* FPU state,
Where did you get that notion. There is nothing in the ARM ABI docs to
support it. In fact, the paragraph quoted above directly contradicts
your claim.
while the hardfp convention is callee-save for most VFP/NEON registers
(d8 and up plus a subset of flags).
D16-D31 are caller-saved.
So those wrappers would have to save all FPU state that the hardfp API
considers callee-save,
Which is _exactly the same_ as the softfp. The AAPCS defines the
caller/callee-saved aspects independently of parameter passing.
whether or not the called function uses the FPU at all -- unless, of
course, you are willing to run the OpenGL libraries through some sort
of binary static analysis in order to find which FPU state each API
touches. Ouch!
Nice straw man.
And while Koen is right that the hardfp calling convention does not
yet have much in the way of benchmark support
Are you implying there is some not yet benchmarked case where it
performs significantly better?
-- and is arguably sub- optimal if your floating-point operations are
concentrated inside innermost C functions --
Using VFP register parameters (i.e. doing nothing) is never less
efficient than moving them to core registers (doing something).
I expect that will change as GCC gets better at using the NEON unit
for integer SIMD and vectorized load/store operations.
Are you saying increased use of NEON by gcc will make hardfp calls
slower?
Especially on Cortex-A9 and later cores -- which don't have the severe
penalty for inter-pipeline transfers,
The A9 and later indeed make the softfp calls less costly, reducing any
advantage hardfp might have (which is already small in benchmarks on A8).
and do have dedicated lanes to memory for the NEON unit
No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.
-- the compiler can tighten up the execution of rather a lot of code
by trampolining structure fetches and stores through the NEON.
Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.
If, that is, it can schedule them appropriately to account for
latencies to and from memory as well as the (reduced but non-zero)
latency of VFP<->ARM transfers.
The out of order issue on A9 and later makes most such tricks unnecessary.
The softfp ABI interferes with this by denying the compiler the
privilege of rescheduling NEON instructions across a function call
-- even one that doesn't actually use any floating point.
To the extent scheduling across function calls is permitted by the C
standard, the manner of passing parameters has no bearing on such
optimisations.
(Any function call to which the ABI applies, anyway; which doesn't
include static C functions, I think, but does include all C++ instance
methods even if they get inlined -- if I remember the spec correctly.)
If a function is fully inlined, the compiler can of course do whatever
it pleases. That is the entire point of inlining.
I should be able to produce some benchmark data in support of this
argument in the next month or so.
You must have a unique approach to benchmarking if it produces results
contradicting everybody else's. Have you considered patenting your
methods?
(don't forget -ffast-math if you really want NEON floating point).
-ffast-math should only be used with extreme caution as it will give
vastly different results in many cases. Specifically, anything relying
on infinities or NaN values becomes unpredictable, and operations with
very large or very small numbers may lose precision.