SGX libs with hardfp support?

Siarhei_Siamashka · June 24, 2011, 11:50am

"Edwards, Michael" <m.k.edwards@gmail.com> writes:

While Måns is right that you could technically create hardfp/softfp
wrappers with a bit of assembly fancy dancing,

There is an even simpler way. Declaring all functions with floating-point
parameters or return values as variadic will force soft-float parameter
passing when calling these. See the AAPCS (IHI0042D) section 6.4.1:

6.4.1 VFP and Base Standard Compatibility

Code compiled for the VFP calling standard is compatible with the base
standard (and vice-versa) if no floating-point or containerized vector
arguments or results are used, or if the only routines that pass or
return such values are variadic routines.

That's a good suggestion.

and do have dedicated lanes to memory for the NEON unit

No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.

Probably he wanted to say that NEON unit from Cortex-A8 can load/store
128 bits of data per cycle when accessing L1 cache *memory*, while
ordinary ARM load/store instructions can't handle more than 64 bits
per cycle there. This makes sense in the context of this discussion
because loading data to NEON/VFP registers directly without dragging
it through ARM registers is not a bad idea.

-- the compiler can tighten up the execution of rather a lot of code
by trampolining structure fetches and stores through the NEON.

Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.

My understanding is that it's exactly the other way around. Using
hardfp allows to avoid going through ARM registers for floating point
data, which otherwise might be needed for the sole purpose of
fulfilling ABI requirements in some cases. You are going a bit
overboard trying to argue with absolutely everything what Edwards has
posted

As for NEON vs. LDM/STM. There are indeed no reasons why for example
NEON memcpy should be faster than LDM/STM for the large memory buffers
which do not fit caches. But still this is the case for OMAP3, along
with some of other memory performance related WTF questions.

If, that is, it can schedule them appropriately to account for
latencies to and from memory as well as the (reduced but non-zero)
latency of VFP<->ARM transfers.

The out of order issue on A9 and later makes most such tricks unnecessary.

VFP/NEON unit from A9 is still in-order.

mansr · June 24, 2011, 12:05pm

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

and do have dedicated lanes to memory for the NEON unit

No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.

Probably he wanted to say that NEON unit from Cortex-A8 can load/store
128 bits of data per cycle when accessing L1 cache *memory*, while
ordinary ARM load/store instructions can't handle more than 64 bits
per cycle there. This makes sense in the context of this discussion
because loading data to NEON/VFP registers directly without dragging
it through ARM registers is not a bad idea.

That has nothing to do with calling conventions.

-- the compiler can tighten up the execution of rather a lot of code
by trampolining structure fetches and stores through the NEON.

Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.

My understanding is that it's exactly the other way around. Using
hardfp allows to avoid going through ARM registers for floating point
data, which otherwise might be needed for the sole purpose of
fulfilling ABI requirements in some cases. You are going a bit
overboard trying to argue with absolutely everything what Edwards has
posted

I think he is under the false impression that softfp doesn't have any
callee-saved registers. If that were the case, a leaf function would
avoid the tiny overhead of preserving d8-d15. I can't imagine any
situation where this would make a difference, even if it were true.

As for NEON vs. LDM/STM. There are indeed no reasons why for example
NEON memcpy should be faster than LDM/STM for the large memory buffers
which do not fit caches. But still this is the case for OMAP3, along
with some of other memory performance related WTF questions.

Using NEON for memcpy has the potential of being more efficient simply
because it has enough registers to hold several cache lines of data.

Michael seems to be arguing for loading things to NEON registers, then
transferring to ARM rather than loading directly to core registers,
which would be an entirely pointless thing to do.

If, that is, it can schedule them appropriately to account for
latencies to and from memory as well as the (reduced but non-zero)
latency of VFP<->ARM transfers.

The out of order issue on A9 and later makes most such tricks unnecessary.

VFP/NEON unit from A9 is still in-order.

The A9 issues normal loads out of order with other integer instructions,
meaning bouncing data through NEON is pointless.

Laurent_Desnogues · June 24, 2011, 12:55pm

[...]

I think he is under the false impression that softfp doesn't have any
callee-saved registers. If that were the case, a leaf function would
avoid the tiny overhead of preserving d8-d15. I can't imagine any
situation where this would make a difference, even if it were true.

The ARM AAPCS says this:

  Registers s16-s31 (d8-d15, q4-q7) must be preserved across
  subroutine calls; registers s0-s15 (d0-d7, q0-q3) do not need
  to be preserved (and can be used for passing arguments or
  returning results in standard procedure-call variants).
  Registers d16-d31 (q8-q15), if present, do not need to be
  preserved.

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/index.html

Laurent

mansr · June 24, 2011, 1:02pm

Laurent Desnogues <laurent.desnogues@gmail.com> writes:

Siarhei_Siamashka · June 24, 2011, 1:21pm

When the hardware has a somewhat weird and unpredictable behavior,
then it surely makes sense trying different ways of doing the same and
selecting whatever appears to provide better results in practice. But
normally holding several cache lines of data is not a very good use
for the registers. All the accesses to memory are buffered and victim
buffer can hold multiple cache lines anyway. So if you are worried
about a potential negative impact of interleaving reads and writes to
SDRAM, then this is supposed to be already addressed.

I don't known about the other Cortex-A8 based SoCs, but for example
Samsung Hummingbird seems to be very good and well predictable for
everything related to memory performance. It only requires some
prefetching via PLD instructions, but that's enough to fully utilize
memory bandwidth in many cases regardless of what kind of instructions
are actually used to access the memory. OMAPs are surely more
difficult and may need special tricks in order not to get an
unexpected performance loss.

mansr · June 24, 2011, 2:25pm

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

Siarhei_Siamashka · June 24, 2011, 3:15pm

If we are speaking about A8 (that's what is used in beagleboards after
all), then write-allocate is not enabled for it in the linux kernel by
default the last time I checked. And based on my old tests, enabling
write-allocate was a real performance disaster for OMAP3430 on
memcpy-alike workload. OMAP3630 was better, but still suffered from
some measurable slowdown. And I could not find any real use cases
where write-allocate could show a clear performance advantage. If you
have some different results which prove write-allocate usefulness for
A8, then I'm definitely interested in this information.

A9 is a bit different beast and write-allocate is needed there for
SMP. But still one of the things OMAP4430 is doing quite well is
memset, so it does not seem to suffer from write-allocate at all.
Anyway, I'm still waiting for my origenboard to be delivered before
doing in-depth comparison between OMAP4 and Exynos4 to get a better
understanding of what ARM Cortex-A9 is actually capable of.

But even theoretically, one store buffer should be enough to eliminate
any needless line fills if the data is sequentially written to a
single destination buffer and if no other unrelated memory writes are
happening in the same inner loop.

Edwards_Michael · June 25, 2011, 5:20am

> "Edwards, Michael" <m.k.edwa...@gmail.com> writes:

>> While Måns is right that you could technically create hardfp/softfp
>> wrappers with a bit of assembly fancy dancing,

> There is an even simpler way. Declaring all functions with floating-point
> parameters or return values as variadic will force soft-float parameter
> passing when calling these. See the AAPCS (IHI0042D) section 6.4.1:

> 6.4.1 VFP and Base Standard Compatibility

> Code compiled for the VFP calling standard is compatible with the base
> standard (and vice-versa) if no floating-point or containerized vector
> arguments or results are used, or if the only routines that pass or
> return such values are variadic routines.

That's a good suggestion.

Except for the fact that it would require changing the Khronos header
files against which the client code builds ... but yes, this is
probably the best available solution if one is stuck with softfp
shared libraries and one has full source code for the caller. Thanks,
Måns, you may have saved my bacon on our OMAP3-ish platform, where I
don't have source code and am not likely to get it.

>> and do have dedicated lanes to memory for the NEON unit

> No core released to date, including the A15, has dedicated memory lanes
> for NEON. All the Cortex-A* cores have a common load/store unit for all
> types of instructions. Some can do multiple concurrent accesses, but
> that's orthogonal to this discussion.

Probably he wanted to say that NEON unit from Cortex-A8 can load/store
128 bits of data per cycle when accessing L1 cache *memory*, while
ordinary ARM load/store instructions can't handle more than 64 bits
per cycle there. This makes sense in the context of this discussion
because loading data to NEON/VFP registers directly without dragging
it through ARM registers is not a bad idea.

That's close to what I meant. The load/store path *to main memory* is
indeed shared. But within the cache hierarchy, at least on the Cortex-
A8, ARM and NEON take separate paths. And that's a good thing,
because the ARM stalls on an L1 miss, and it would be rather bad if it
had to wait for a big NEON transfer to complete before it could fill
from L2. Moreover, the only way to get "streaming" performance (back-
to-back AXI burst transactions) on uncacheable regions is by using the
NEON. That's almost impossible to determine from the TRM, but it's
there: Documentation – Arm Developer
. Compare against the LDM/STM section of
Documentation – Arm Developer
.

On the A8, the NEON bypasses the L1 cache, and has a dedicated lane
(probably the wrong word, sorry) into the L2 cache -- or for
uncacheable mappings, *past* the L2 per se to its AXI scheduler. See
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/ch08s02s02.html
. In addition, NEON load/store operations can be issued in parallel
with integer code, and there can be as many as 12 NEON reads
outstanding in L2 -- vs. the maximum of 4 total cache line refills and
evictions. So if you are moving data around without doing non-SIMD
operations on it, and without branching based on its contents, you can
do so without polluting L1 cache, or contending with L1 misses that
hit in L2.

There will be some contention between NEON-side loads and ARM-side L2
misses, but even that is negligible if you issue a preload early
enough (which you should do anyway for fetches that you suspect will
miss L2, because the compiler schedules loads based on the assumption
of an L1 hit; an L1 miss stalls the ARM side until it's satisfied).
Preloads do not have any effect if you miss in TLB, and they don't
force premature evictions from L1 cache (they only load as far as
L2). And the contention on the write side is negligible thanks to the
write allocation mechanism, except insofar as you may approach
saturation of the AXI interface due to the total rate of L2 evictions/
linefills and cache-bypassing traffic -- in which case,
congratulations! Your code is well tuned and operates at the maximum
rate that the path to main memory permits.

If you are fetching data from an uncacheable region, using the NEON to
trampoline into a cacheable region should be a *huge* win. Remember,
an L1 miss stalls the ARM side, and the only way to get data into L1
is to fetch and miss. If you want it to hit in L2, you have to use
the NEON to put it there, by fetching up to 128 bytes at a go from the
uncacheable region (e. g., VLDM r1,{d16-d31}) and storing it to a
cacheable buffer (i. e., only as far as L2, since you write it again
and again without an eviction). You want to limit fetches from the
ARM side to cacheable regions; otherwise every LDM is a round-trip to
AXI.

The store story is similar. You want the equivalent of the x86's
magic "fill buffers" -- which avoid the read-modify-write penalty when
writing whole cache lines' worth of data through uncacheable write-
combining mappings, but only if you use cache-bypassing SSE2 writes.
To get it, you need to write from the ARM to cacheable memory, then
load that data to NEON registers and store from there. That pushes up
to two whole cache lines' worth of data at a time down to the L2
controller, which queues the write without blocking the NEON. (This
is the only way to get an AXI burst longer than 4 128-bit transactions
without using the preload engine.)

One more nice thing about doing your bulk data transfers this way,
instead of monkeying with some DMA unit (which you probably can't do
in userland anyway), is that there are no explicit cache operations to
deal with. You don't have to worry about data stalling in L1, because
the NEON loads do peek data *from* L1 even though they don't load data
*to* L1. (Not unless you turn on the L1NEON bit in the Auxiliary
Control Register, which you don't want to do unless you have no L2
cache, in which case you have a whole different set of problems.)

The Cortex-A9 is a whole different animal, with out-of-order issue on
the ARM side and two automatic prefetch mechanisms (based on detection
of miss patterns at L1 and, in MPCore only, at L2). It also has a far
less detailed TRM, so I can't begin to analyze its memory hierarchy.
Given that the L2 cache has been hived off to an external unit, and
the penalty for transfers between the ARM and NEON units has been
greatly decreased, I would guess that the NEON goes through the L1
just like the ARM. That changes the game a little -- the NEON
transfers to/from cacheable memory can now cause eviction of the ARM's
working set from L1 -- but in practice that's probably a wash. The
basic premise (that you want to do your noncacheable transactions in
big bursts, feasible only from the NEON side) still holds.

>> -- the compiler can tighten up the execution of rather a lot of code
>> by trampolining structure fetches and stores through the NEON.

> Do you have any numbers to back this up? I don't see how going through
> NEON registers would be faster than direct LDM/STM on any core.

My understanding is that it's exactly the other way around. Using
hardfp allows to avoid going through ARM registers for floating point
data, which otherwise might be needed for the sole purpose of
fulfilling ABI requirements in some cases. You are going a bit
overboard trying to argue with absolutely everything what Edwards has
posted

Not just for floating point data, but for SIMD integer data as well,
or really anything you want -- as long as you frame it as a
"Homogeneous Aggregate of containerized vectors". That's an extra 64
bytes of structure that you can pass in, and let the callee decide
whether and when to spill a copy to a cache-line-aligned buffer (so
that it can then fetch the lot to the ARM L1 -- which might as well be
registers, as far as memory latency is concerned -- in one L1 miss).
Or you can do actual float/SIMD operations with the data, and return a
healthy chunk in registers, without ever touching memory. (To be
precise, per the AAPCS, you can pass in two 32-byte chunks as
"Homogeneous Aggregates with a Base Type of 128-bit containerized
vectors with four Elements", and return one.)

The point is not really to have "more registers"; the integer
"registers" are just names anyway, and the L1 cache is almost as
close. Nor is it to pass floating point values to and from public
function calls cheaply; that's worth almost nothing on system scale.
Even in code that uses no floating point or SIMD whatever, there are
potentially big gains from:

* postponing the transfer of up to 64 bytes of operands from the VFP/
NEON bank to the integer side, allowing more time for pending NEON
operations (especially structure loads) to complete;

* omitting the transfer from NEON to ARM entirely, if the operands
turn out to be unneeded (or simply written elsewhere in memory without
needing to be touched by the ARM);

* returning up to 32 bytes of results in the VFP/NEON register bank,
possibly from an address that missed in L2, without stalling to wait
for a pending load to complete;

* preserving an additional 32 bytes of VFP/NEON state across
functions that don't need big operands or return values, if you are
willing to alter their function signatures to do so (at zero run-time
cost, if you're systematic about it);

* and, if you really do have to move those operands to the ARM,
doing so explicitly and efficiently (by spilling the whole block to a
cache-line-aligned buffer in L2, fetching it back into L1 with a
single load, and filling the delay with some other useful work)
instead of in the worst way possible (by transferring them from VFP to
ARM registers, 4 bytes at a time, before entering the function).

As for NEON vs. LDM/STM. There are indeed no reasons why for example
NEON memcpy should be faster than LDM/STM for the large memory buffers
which do not fit caches. But still this is the case for OMAP3, along
with some of other memory performance related WTF questions.

I hope I've clarified this a bit above. But don't take my word for
it; these techniques are almost exactly the same as those described in
Intel's cheat sheet at Intel® Media SDK
, except that there is no need for the equivalent of "fill buffers" /
"write combining buffers" because VLDM/VSTM can move 128 bytes at a
time. (It's probable that the right micro-optimization is to work in
64-byte chunks and pipeline more deeply; I haven't benchmarked yet.)

>> If, that is, it can schedule them appropriately to account for
>> latencies to and from memory as well as the (reduced but non-zero)
>> latency of VFP<->ARM transfers.

> The out of order issue on A9 and later makes most such tricks unnecessary.

VFP/NEON unit from A9 is still in-order.

True but mostly irrelevant. If your code is at all tight, and your
working set doesn't fit into L2 cache, all the mere arithmetic
pipelines should be stalled most of the time. The name of the game is
to race as quickly as possible from one fetch from an uncacheable /
unpredictable address to the next that depends on it, and to get as
high an interleave among such fetch chains as possible. If your
working set isn't larger than L2 cache, why bother thinking about
performance at all? Your algorithm could be O(n^3) and coded by
banana slugs, and it would still get the job done.

Cheers,
- Michael

Edwards_Michael · June 25, 2011, 9:14am

The first half of my response is in the earlier reply to Siarhei;
please read it first. That reply included at least one stupid error;
of course a "Homogeneous Aggregate with a Base Type of 128-bit
containerized vectors with four Elements" fills Q0-Q3 (64 bytes), so
you can preserve all 64 bytes across a non-floating-point-using API,
or pass a 64-byte return value, not just 32. Please chalk that, and
any equally trivial thinkos, up to my being a bit short on sleep. You
should still flame me for major thinkos.

Otherwise, this just attempts to answer those of Måns's critiques that
Siarhei didn't follow up on.

The performance would be no more terrible than that of a system built
with softfloat calls using the libraries unaltered, and the performance
of such systems is apparently adequate.

Not for my purposes, it's not; but then I write a lot of heavily
templatized C++ code, and am willing to go through some fairly ugly
contortions to tighten it up. However, my statement about having to
wrap function calls with no floating point parameters was quite wrong,
and I retract it unconditionally.

> That's because the softfp calling convention permits the callee to
> smash essentially *all* FPU state,

Where did you get that notion. There is nothing in the ARM ABI docs to
support it. In fact, the paragraph quoted above directly contradicts
your claim.

You're absolutely right. Q4-Q7 are just as callee-save under the
softfp ABI as they are under the hardfp ABI. The only additional
*explicit* state that the official hardfp convention allows one to
preserve -- not trivially, but with some effort -- is Q0-Q3. (That
can be done by systematically altering your otherwise non-floating-
point-using APIs.)

I've been so focused on the latency issues associated with *implicit*
state (not stalling for pending NEON loads into Q0-Q3 to complete, on
the way either in and out) that I haven't bothered to look at the
AAPCS in a while. I've been repeating a misremembered, false
statement about the callee-save register set. Thank you for
correcting me.

> while the hardfp convention is callee-save for most VFP/NEON registers
> (d8 and up plus a subset of flags).

D16-D31 are caller-saved.

Mmm, so they are. This is another thing I was misremembering.
Largely because I don't permit userland code to use them. I work on
embedded systems where I control how all the code is compiled, and I
compile for a neon-d16-fp16 model that doesn't correspond to any real
hardware. I intend to reserve the upper half of the VFP/NEON register
bank for use in-kernel, so I can trampoline data moves through D16-D31
without having to save userland's content and restore it afterwards.
(Not because saving and restoring them is expensive, but because it
would have to be done from a place in the kernel where the FPU context-
save thingy is handy. I'd rather just use Q8-Q15 as scratch registers
anywhere in the kernel I want to, with nothing to save/restore but the
FPSCR.)

So the fact that D16-D31 are caller-saved per the AAPCS -- which is
silly, if you ask me! Any callee that needs them is macroscopic
anyway! -- got mixed up in my brain with the fact that the kernel
would have to save them before use when the userland process is
preempted. Mea culpa.

> So those wrappers would have to save all FPU state that the hardfp API
> considers callee-save,

Which is _exactly the same_ as the softfp. The AAPCS defines the
caller/callee-saved aspects independently of parameter passing.

Actually, the wrappers don't have to save callee-save FPU state,
because the functions they call will. Mea culpa again.

> whether or not the called function uses the FPU at all -- unless, of
> course, you are willing to run the OpenGL libraries through some sort
> of binary static analysis in order to find which FPU state each API
> touches. Ouch!

Nice straw man.

And quite wrong, as written. I'd need to do that if I wanted to
"preserve" Q0-Q3 by subterfuge across calls to non-floating-point-
using APIs, without actually saving and restoring them. There are
APIs for which it may be worth doing this, but the OpenGL interfaces
aren't among them.

> And while Koen is right that the hardfp calling convention does not
> yet have much in the way of benchmark support

Are you implying there is some not yet benchmarked case where it
performs significantly better?

Oh yes. Presently, only when combined with APIs that sling structures
opaquely as composite types, and code that uses NEON intrinsics to
load and store them. But I am expecting those techniques to become
common inside template libraries within the next couple of years. And
even in some non-template libraries; you might take a look at the NEON
specializations inside Cairo and libjpeg-turbo, and extrapolate those
to the hard-float case.

> -- and is arguably sub- optimal if your floating-point operations are
> concentrated inside innermost C functions --

Using VFP register parameters (i.e. doing nothing) is never less
efficient than moving them to core registers (doing something).

On the contrary; hardfp can definitely be a net lose on real code.
Consider cases where the outer function slings structures with mixed
integers and floats, and the inner function does the actual floating
point arithmetic. The hardfp convention requires the caller to
transfer floating point parameters into VFP registers before entering
the function, rather than leaving them in integer registers (where
they can be put for free, because they are already in L1). Even if I
give you that the callee wants to do VFP arithmetic on those operands,
the compiler won't know not to schedule that arithmetic as the first
instruction after the function preamble, stalling the NEON until the
transfer completes. If it comes in via the integer side, the compiler
has all the latency information when compiling the callee, and may
well produce code that runs faster in practice.

That's probably a trivial effect; but at least on Cortex-A8, there are
others that hit some code bases much harder. What if the callee does
no arithmetic, but passes the argument to a variadic function? Or the
callee returns a value fetched from memory, which happens to be
floating point, and the caller turns around and sticks it into an
otherwise integer-filled structure? Either way, you take the full hit
of the transfer to D0 and back to the integer side, for nothing. Both
of these are actually quite common cases in template expansions. So
the hard-float convention isn't a panacea, and has to be treated with
some care when combined with compile-time polymorphism.

> I expect that will change as GCC gets better at using the NEON unit
> for integer SIMD and vectorized load/store operations.

Are you saying increased use of NEON by gcc will make hardfp calls
slower?

The reverse; but I can understand your reading my contorted syntax
that way. I expect that GCC will get better at using the NEON unit
for non-floating-point purposes. That will make it worthwhile for
core libraries, from eglibc and libstdc++ on up, to adapt their
internal calling conventions to permit the sort of "stupid
rescheduling tricks" that win when building hardfp.

You may say that it shouldn't matter for APIs that aren't "publicly
visible", and that no human-readable API should do stupid things like
pass an opaque operand in Q0-Q3 and return it unchanged as its return
value (still in Q0-Q3). But in practice, "publicly visible" includes
any symbol exposed by a shared library, and in the case of libstdc++
that means not the standards-based template APIs but the base classes
underneath. Take a look at Boost for an idea of what template wizards
will do to transform a (relatively) human-friendly API into the ABI
nastiness that makes for high performance.

> Especially on Cortex-A9 and later cores -- which don't have the severe
> penalty for inter-pipeline transfers,

The A9 and later indeed make the softfp calls less costly, reducing any
advantage hardfp might have (which is already small in benchmarks on A8).

Even the idea that A9 is less friendly *overall* to hardfp than A8 is
debatable, at the current level of compiler implementation. But I
will refrain from debating it without data.

> and do have dedicated lanes to memory for the NEON unit

No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.

Sorry, not "dedicated lanes to memory"; I seem to recall that's true
of one of the non-ARM implementations of ARMv7+NEON, but you're
correct that it is untrue of both A9 and A15. (Snapdragon?
Hummingbird? I forget, and could be totally wrong.) What I should
have said is that the A9 and later cores have enough throughput to the
next layer of the memory hierarchy that it probably can't be saturated
without a higher interleave than they can achieve with cache refills/
evictions alone. But now that I look back at the details of the A8,
that's probably true of it as well. So let's rephrase that to "memory
throughput that is effectively dedicated to the NEON, because there's
no other way to use it".

> -- the compiler can tighten up the execution of rather a lot of code
> by trampolining structure fetches and stores through the NEON.

Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.

I will produce those numbers within the month, or admit defeat.
Seriously, I'd better be able to substantiate this by mid-July or so,
or my team is going to have to rethink certain aspects of one of its
current development efforts.

> If, that is, it can schedule them appropriately to account for
> latencies to and from memory as well as the (reduced but non-zero)
> latency of VFP<->ARM transfers.

The out of order issue on A9 and later makes most such tricks unnecessary.

Er, no. Out of order issue helps reduce bubbles in the ALU for math-
intensive loads whose working set fits in cache. It doesn't do a
whole lot for the sort of systems I build. In some ways it's a
defect, because it makes it harder to control interleave patterns.
But this is not something I lose sleep over; there's all sorts of
other traffic going through my SoC's memory arbiter, and the ARM has
to settle for the leftovers. So it has to sprint as hard as it can
during the occasional, more or less periodic, relatively uncontended
intervals.

> The softfp ABI interferes with this by denying the compiler the
> privilege of rescheduling NEON instructions across a function call
> -- even one that doesn't actually use any floating point.

To the extent scheduling across function calls is permitted by the C
standard, the manner of passing parameters has no bearing on such
optimisations.

OK, I admit that I'm planning to cheat here. I'm going to keep state
that the compiler would otherwise allocate to the callee-save
registers in Q0-Q3, and keep passing this block into and back out of
mostly non-floating-point-using APIs, which effectively makes it
callee-save state that doesn't wind up being touched by the callee.
When combined with the neon-d16-fp16 model, this should induce the
compiler to use Q4-Q7 as its NEON working set. Since it knows this
range is callee-save, it's safe to schedule loads with ample provision
for cache miss latency, even if it has to move them across function/
method calls.

GCC may not be smart (stupid?) enough to do this yet. If not, I guess
I have some hacking to do. But as far as I can tell there's nothing
in the AAPCS that says I shouldn't do this; and in fact I shouldn't
have to cajole the compiler this hard. There really ought to be a way
to annotate a pointer variable to tell the compiler that fetches
through it will probably miss cache, and should be scheduled
accordingly. That should induce the use of Q4-Q7 where appropriate.

Anyhow, you're right that this is not a virtue of hardfp per se. The
only incremental effect of hardfp is to enable one layer of code to
use Q0-Q3 as quasi-callee-save state, with the cooperation of the API
of the layer beneath. Which may prove useful in the hands of the
Boosties, or even of a relative amateur like myself; but that's sort
of orthogonal to the set-the-compiler-free argument, because the
compiler can't know that the callee keeps its hands off Q0-Q3 and thus
it's legit to move a fetch (or store) across such a function call. It
can only do that with official callee-save registers, which are the
same for both ABI variants.

> (Any function call to which the ABI applies, anyway; which doesn't
> include static C functions, I think, but does include all C++ instance
> methods even if they get inlined -- if I remember the spec correctly.)

If a function is fully inlined, the compiler can of course do whatever
it pleases. That is the entire point of inlining.

I think it's a little subtler than that in C++; but I am no language
lawyer. Suffice it to say that what the compiler does *in practice*
appears to be heavily influenced by whether there is any way for the
method to be called through a "publicly visible" symbol. I have seen
my binaries get a lot tighter when I ensured that a given instance
method was not merely "inline" but also "private", and thus the
compiler could verify that nobody could ever take its address. In a
few cases I even introduced a trivial wrapper, so that I could take
the address of the wrapper (indirectly via the likes of sigc::bind and
std::tr1::mem_fn) rather than of the "inline private" instance
method. And that change -- even though it did not affect the inline
private method's definition in any way -- affected the sites where it
was inlined! Strange are the ways of the C++ gods.

> I should be able to produce some benchmark data in support of this
> argument in the next month or so.

You must have a unique approach to benchmarking if it produces results
contradicting everybody else's. Have you considered patenting your
methods?

I suppose I could probably patent some of these techniques, in some
legal regimes. I wouldn't necessarily even have to prove they work
first. But in my experience the patent application process is such a
giant pain in the @$$ that I'd rather save it for things that are not
merely clever but revenue-generating. You see a way to make money off
this, *you* patent it.

In any case, the whole point of running my own benchmarks is to
produce results that "contradict" everyone else's. That's because
"everyone else" tries to make hardfp solve the wrong problem (how fast
you can make the wheels spin on a toy working set) instead of the
right problem (how fast you can get hold of a new working set in
response to an unpredictable external stimulus). (In actuality, I
think there are plenty of people, some far more skilled than, I
working on the right problem; but not a lot of that work is being done
in public view.)

> (don't forget -ffast-math if you really want NEON floating point).

-ffast-math should only be used with extreme caution as it will give
vastly different results in many cases. Specifically, anything relying
on infinities or NaN values becomes unpredictable, and operations with
very large or very small numbers may lose precision.

Agreed completely. I just wanted to make it clear that the fact that
I have baked -mfpu=neon-d16-fp16 into the toolchain I use does not
mean that it will generate NEON floating point by default. (Lest
someone else should benchmark it relative to a toolchain that has been
foolishly altered to use -ffast-math by default ... that way lies
madness.)

Cheers,
- Michael

mansr · June 25, 2011, 4:36pm

"Edwards, Michael" <m.k.edwards@gmail.com> writes:

The performance would be no more terrible than that of a system built
with softfloat calls using the libraries unaltered, and the performance
of such systems is apparently adequate.

Not for my purposes, it's not; but then I write a lot of heavily
templatized C++ code, and am willing to go through some fairly ugly
contortions to tighten it up. However, my statement about having to
wrap function calls with no floating point parameters was quite wrong,
and I retract it unconditionally.

> That's because the softfp calling convention permits the callee to
> smash essentially *all* FPU state,

Where did you get that notion. There is nothing in the ARM ABI docs to
support it. In fact, the paragraph quoted above directly contradicts
your claim.

You're absolutely right. Q4-Q7 are just as callee-save under the
softfp ABI as they are under the hardfp ABI. The only additional
*explicit* state that the official hardfp convention allows one to
preserve -- not trivially, but with some effort -- is Q0-Q3. (That
can be done by systematically altering your otherwise non-floating-
point-using APIs.)

I fail to make sense of that paragraph. D0-D7 are call-clobbered, no
exceptions. If they are not used for arguments, the callee may still
use them as scratch registers.

> while the hardfp convention is callee-save for most VFP/NEON registers
> (d8 and up plus a subset of flags).

D16-D31 are caller-saved.

Mmm, so they are. This is another thing I was misremembering.
Largely because I don't permit userland code to use them.

So you've invented your own, crippled ABI, then complain about
performance. Clever.

I work on embedded systems where I control how all the code is
compiled, and I compile for a neon-d16-fp16 model that doesn't
correspond to any real hardware.

Any NEON implementation is required to have the full set of 32 D
registers. If you allow NEON, there is no point in restricting the
number of registers. (For pure VFP code, doing so allows the same code
to be used on both full and reduced register set implementations, at a
slight performance cost.)

I intend to reserve the upper half of the VFP/NEON register bank for
use in-kernel, so I can trampoline data moves through D16-D31 without
having to save userland's content and restore it afterwards. (Not
because saving and restoring them is expensive, but because it would
have to be done from a place in the kernel where the FPU context- save
thingy is handy. I'd rather just use Q8-Q15 as scratch registers
anywhere in the kernel I want to, with nothing to save/restore but the
FPSCR.)

I can't imagine the cost of stealing these registers from heavy
float/simd users being compensated by a few minor savings in the kernel.

> And while Koen is right that the hardfp calling convention does not
> yet have much in the way of benchmark support

Are you implying there is some not yet benchmarked case where it
performs significantly better?

Oh yes. Presently, only when combined with APIs that sling structures
opaquely as composite types, and code that uses NEON intrinsics to
load and store them.

Sounds like poor API design.

But I am expecting those techniques to become common inside template
libraries within the next couple of years.

If you are right, that's yet another reason to avoid such libraries.

And even in some non-template libraries; you might take a look at the
NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
those to the hard-float case.

I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.

> -- and is arguably sub- optimal if your floating-point operations are
> concentrated inside innermost C functions --

Using VFP register parameters (i.e. doing nothing) is never less
efficient than moving them to core registers (doing something).

On the contrary; hardfp can definitely be a net lose on real code.
Consider cases where the outer function slings structures with mixed
integers and floats, and the inner function does the actual floating
point arithmetic. The hardfp convention requires the caller to
transfer floating point parameters into VFP registers before entering
the function, rather than leaving them in integer registers (where
they can be put for free, because they are already in L1).

Sounds like that API really ought to be passing a pointer to a struct,
not passing the struct by value.

That's probably a trivial effect; but at least on Cortex-A8, there are
others that hit some code bases much harder. What if the callee does
no arithmetic, but passes the argument to a variadic function? Or the
callee returns a value fetched from memory, which happens to be
floating point, and the caller turns around and sticks it into an
otherwise integer-filled structure? Either way, you take the full hit
of the transfer to D0 and back to the integer side, for nothing.

You seem to be missing something about how structs are actually
represented at the backend of a compiler.

> I expect that will change as GCC gets better at using the NEON unit
> for integer SIMD and vectorized load/store operations.

Are you saying increased use of NEON by gcc will make hardfp calls
slower?

The reverse; but I can understand your reading my contorted syntax
that way. I expect that GCC will get better at using the NEON unit
for non-floating-point purposes. That will make it worthwhile for
core libraries, from eglibc and libstdc++ on up, to adapt their
internal calling conventions to permit the sort of "stupid
rescheduling tricks" that win when building hardfp.

You may say that it shouldn't matter for APIs that aren't "publicly
visible", and that no human-readable API should do stupid things like
pass an opaque operand in Q0-Q3 and return it unchanged as its return
value (still in Q0-Q3).

Such a constraint cannot be expressed in a C API (nor a C++ one AFAIK).
To make that work, you'd have to either:

1. Change the ABI spec.
2. Teach the compiler extended semantics about specific functions in the
same way it already recognises many standard library calls.
3. Write all code by hand in assembler with no standard calling
conventions at all.

None of these seem particularly compelling, nor likely to happen.

> Especially on Cortex-A9 and later cores -- which don't have the severe
> penalty for inter-pipeline transfers,

The A9 and later indeed make the softfp calls less costly, reducing any
advantage hardfp might have (which is already small in benchmarks on A8).

Even the idea that A9 is less friendly *overall* to hardfp than A8 is
debatable, at the current level of compiler implementation.

The A9 is not in any way "less friendly" to hardfp. It is, however,
less hostile to softfp.

> -- the compiler can tighten up the execution of rather a lot of code
> by trampolining structure fetches and stores through the NEON.

Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.

I will produce those numbers within the month, or admit defeat.
Seriously, I'd better be able to substantiate this by mid-July or so,
or my team is going to have to rethink certain aspects of one of its
current development efforts.

I'm glad I'm not invested in that effort.

> If, that is, it can schedule them appropriately to account for
> latencies to and from memory as well as the (reduced but non-zero)
> latency of VFP<->ARM transfers.

The out of order issue on A9 and later makes most such tricks unnecessary.

Er, no. Out of order issue helps reduce bubbles in the ALU for math-
intensive loads whose working set fits in cache.

Out of order issue potentially allows a load to be issued sooner than it
appears in the instruction stream, thus hiding some of the latency
whether it hits L1 or not.

> The softfp ABI interferes with this by denying the compiler the
> privilege of rescheduling NEON instructions across a function call
> -- even one that doesn't actually use any floating point.

To the extent scheduling across function calls is permitted by the C
standard, the manner of passing parameters has no bearing on such
optimisations.

OK, I admit that I'm planning to cheat here. I'm going to keep state
that the compiler would otherwise allocate to the callee-save
registers in Q0-Q3, and keep passing this block into and back out of
mostly non-floating-point-using APIs, which effectively makes it
callee-save state that doesn't wind up being touched by the callee.

So you've modified the ABI again.

When combined with the neon-d16-fp16 model, this should induce the
compiler to use Q4-Q7 as its NEON working set. Since it knows this
range is callee-save, it's safe to schedule loads with ample provision
for cache miss latency, even if it has to move them across function/
method calls.

So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.

> (Any function call to which the ABI applies, anyway; which doesn't
> include static C functions, I think, but does include all C++ instance
> methods even if they get inlined -- if I remember the spec correctly.)

If a function is fully inlined, the compiler can of course do whatever
it pleases. That is the entire point of inlining.

I think it's a little subtler than that in C++; but I am no language
lawyer. Suffice it to say that what the compiler does *in practice*
appears to be heavily influenced by whether there is any way for the
method to be called through a "publicly visible" symbol.

A function identifiable as a symbol, public or not, is by definition not
inlined. It is perfectly legal for the compiler to inline some or all
calls to a function while still producing a symbol with a valid entry
point for it. If this happens, this symbol must of course behave
according to ABI rules. For the inlined "calls", there is no ABI-level
call, and thus calling conventions no longer apply.

In summary, you have created your own ABI that reserves most of the
VFP/NEON registers for special uses that conflict with how AAPCS/VFP
passes floating-point arguments to functions. You then use this as
foundation for a series of contradicting arguments for and/or against
the hardfp ABI over softfp.

Edwards_Michael · June 27, 2011, 8:10am

"Edwards, Michael" <m.k.edwa...@gmail.com> writes:
>> Where did you get that notion. There is nothing in the ARM ABI docs to
>> support it. In fact, the paragraph quoted above directly contradicts
>> your claim.

> You're absolutely right. Q4-Q7 are just as callee-save under the
> softfp ABI as they are under the hardfp ABI. The only additional
> *explicit* state that the official hardfp convention allows one to
> preserve -- not trivially, but with some effort -- is Q0-Q3. (That
> can be done by systematically altering your otherwise non-floating-
> point-using APIs.)

I fail to make sense of that paragraph. D0-D7 are call-clobbered, no
exceptions. If they are not used for arguments, the callee may still
use them as scratch registers.

As I wrote later in that message, I intend to adjust some of my inner
APIs to take an extra argument that maps to Q0-Q3 / D0-D7, and return
its value as their return value. That should have the effect of
leaving those registers untouched across the function / method call.
And if the callee has use for them, it can always save and restore
them. The compiler won't really know about this convention, so it
won't reschedule loads to Q0-Q3 across the function call; but
otherwise you're right, I'm adapting the ABI to my needs without
modifying the compiler.

>> D16-D31 are caller-saved.

> Mmm, so they are. This is another thing I was misremembering.
> Largely because I don't permit userland code to use them.

So you've invented your own, crippled ABI, then complain about
performance. Clever.

Yes, isn't it? But I'm not complaining at all; it's already doing
good things for my system's performance, and I haven't even done the
kernel work yet. For reasons that aren't yet apparent to me, a system
compiled uniformly with this hardfp neon-d16-fp16 ABI appears to
slightly outperform the identical code compiled for the regular hardfp
neon-fp16 model.

The only rational reason for this that I've been able to come up with
is that I haven't altered inline assembly to not use D16-D31, and when
the compiler doesn't use them for C/C++ code, it doesn't have to save
and restore them around these inline assembly blocks. This may or may
not be correct; I haven't had time to investigate it yet. But no, I'm
not complaining.

> I work on embedded systems where I control how all the code is
> compiled, and I compile for a neon-d16-fp16 model that doesn't
> correspond to any real hardware.

Any NEON implementation is required to have the full set of 32 D
registers. If you allow NEON, there is no point in restricting the
number of registers. (For pure VFP code, doing so allows the same code
to be used on both full and reduced register set implementations, at a
slight performance cost.)

As I think I explained, the point of restricting the number of
registers used in userland code is to leave them free for use in
kernel code, without save/restore overhead (other than FPSCR -- and if
the kernel code doesn't use FCMP, it doesn't even need to save/restore
FPSCR, except during a context switch). This obviously doesn't work
if you don't have complete control over every line of code in your
system, because any userland process that is compiled for a normal 32-
register NEON model is in for an unpleasant surprise. But that
complete control is one of the few advantages I do have on an embedded
platform, and I'm workin' it for all it's worth.

This raises another commonly misunderstood point. It can actually be
advantageous to compile most userland code without NEON, even for
memcpy/strcpy. That's because the kernel doesn't have to save/restore
FPU state on context switch for processes that have not touched the
FPU since the last context switch. If you choose to tune your kernel
this way, the VFP/NEON unit will be disabled on exit from the context
switch path. The first FPU instruction issued from userland will
generate an illegal instruction trap, which the kernel will catch; it
will restore the process's FPU context and reissue (or emulate) the
instruction that trapped. I think that in recent kernels you can opt
out of this "lazy restore" mechanism -- either at kernel configuration
time or per-process -- and if you use a NEONized memcpy, you probably
should.

It can also be an advantage to have 16 rather than 32 VFP registers,
because you have half as much context to save and restore. However,
no NEON implementation of which I'm aware can be told to trap on
access to D16-D31. So if your hardware has the "full" VFP/NEON
register set, you have to save/restore the full set, even for
processes whose code is compiled for a vfpv3-d16 model -- because
that's not part of the ABI contract. Unless, of course, you control
the compilation every line of code on your embedded system, in which
case you can do what you want. (You still have to audit the assembly
code throughout your system for use of the upper half of the VFP bank;
I plan to run for a while with 0xdeadbeef tell-tales, verified in the
context switch code path, to catch whatever my static analysis
misses.)

> I intend to reserve the upper half of the VFP/NEON register bank for
> use in-kernel, so I can trampoline data moves through D16-D31 without
> having to save userland's content and restore it afterwards. (Not
> because saving and restoring them is expensive, but because it would
> have to be done from a place in the kernel where the FPU context- save
> thingy is handy. I'd rather just use Q8-Q15 as scratch registers
> anywhere in the kernel I want to, with nothing to save/restore but the
> FPSCR.)

I can't imagine the cost of stealing these registers from heavy
float/simd users being compensated by a few minor savings in the kernel.

Well, I tried to explain the part about keeping save/restore overhead
down. I can add a couple of things: unlike ARM-side
"registers" (which are really just labels in the instruction stream,
and are allocated from a larger pool of physical registers), NEON
registers are locked to real hardware locations. So if the kernel
needs to spill userland's values from D16-D31 in order to use them for
bulk data moves, the store operation is going to stall waiting for the
completion of any outstanding userland-initiated pipeline activity
involving them. And on the return to userland, the load operation
that restores their contents will have to complete before the user
process can really get going again.

This is part of why modern processors of the x86/x86_64 architecture
have FXSAVE/FXRSTOR. These operations spill not just the contents of
the visible floating-point registers but also internal pipeline state,
so they don't have to stall for all in-progress operations to
complete. They also don't really spill all the way to main memory
unless all the shadow FPU contexts have been allocated. A lot of this
is neither architecturally visible nor particularly well documented,
but Intel and AMD have gone to truly amazing lengths to optimize their
processors for real-world workloads, including the sort of frequent
context switches among a small set of processes that are typical of
desktop OSes. (Intel in particular learned this the hard way; was
anyone else here around for the i860?)

So for short trips into and out of kernel whose main job is to move a
few hundred bytes of data from here to there, it ought to be a
substantial win to be able to trampoline through Q8-Q15 without the
overheads, visible and invisible, of a save/restore cycle. This goes
double if I'm going to use NEON instructions from within ISRs to move
data into and out of uncacheable memory. (That's exactly what we do
today on our x86 SoC -- with MOVNTDQ(A) substituted for VLDM/VSTM, of
course -- to work around a silicon erratum which requires us to flush
an architecturally invisible buffer in the chip's DRAM arbiter.)

Perhaps someone else could try rephrasing in language Måns might find
more enlightening -- or correcting me if I'm wrong, which is always
possible. Otherwise, I guess we're going to have to wait until the
benchmarks are in. Obviously, if reserving D16-D31 for kernel use
doesn't prove to be a win in our full system, we won't do it. But my
measure of "win" may be different from yours. I don't care about
maximizing the idle fraction of CPU; I care about making my system's
UI as responsive and jitter-free as possible, even though the bulk of
the SoC's throughput to DRAM is occupied by video capture/encode/
decode/display traffic.

>> Are you implying there is some not yet benchmarked case where it
>> performs significantly better?

> Oh yes. Presently, only when combined with APIs that sling structures
> opaquely as composite types, and code that uses NEON intrinsics to
> load and store them.

Sounds like poor API design.

Hey, I'd love to have an official ABI in which I get to choose,
function by function, whether Q0-Q3 are parameter-passing/scratch
registers or callee-save. Failing that, I am making do with what I do
have, which is a kludge that I can hide behind some C++ template
magic. In no way do I consider this a shining beacon of API design;
but for embedded work, I'll take an adequately documented, somewhat
ugly, screaming fast API over an elegant but slow one every time.
YMMV.

> But I am expecting those techniques to become common inside template
> libraries within the next couple of years.

If you are right, that's yet another reason to avoid such libraries.

This is the crux of the matter, isn't it? I don't begin to understand
most of the techniques at work inside GCC, let alone G++ or Boost, but
I am quite content to use them. And, when necessary, to learn how to
abuse them for fun and profit.

> And even in some non-template libraries; you might take a look at the
> NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
> those to the hard-float case.

I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.

As I wrote, "extrapolate these to the hard-float case". If you look
at the code a bit, perhaps you can see the potential benefit of
refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
using compiler intrinsics rather than raw assembly, and letting the
compiler handle register allocation and load/store latencies? And of
rewriting idct_helper and transpose_4x4 as inline functions, operating
on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte chunk of
data passed by value? That's exactly what the datatypes defined in
AAPCS are for.

In this particular instance it wouldn't make any difference if we were
to pass operands by value into the innermost "publicly visible"
function, simply because they are too large (128 bytes). But a more
extensive refactor would permit this function to be inlined, and that
would definitely tighten things up. Point being, I didn't mean to say
that these techniques (multiple-cache-line-sized loads/stores, use of
containerized vector datatypes and pass-by-value) were already in use
in these libraries. I meant what I said, which is that I expect them
to become common in libraries where they are worth the effort, which
will include some non-template libraries.

>> Using VFP register parameters (i.e. doing nothing) is never less
>> efficient than moving them to core registers (doing something).

> On the contrary; hardfp can definitely be a net lose on real code.
> Consider cases where the outer function slings structures with mixed
> integers and floats, and the inner function does the actual floating
> point arithmetic. The hardfp convention requires the caller to
> transfer floating point parameters into VFP registers before entering
> the function, rather than leaving them in integer registers (where
> they can be put for free, because they are already in L1).

Sounds like that API really ought to be passing a pointer to a struct,
not passing the struct by value.

The inner function doesn't know anything about the struct; it operates
on bare floats/doubles. The outer function slings mixed structures,
and as soon as it touches them at all, it has them in L1. Under the
softfp convention, the outer function can pull the floating point
operands of the inner function into integer registers any time it's
convenient, maybe as part of an LDM that pulls in some integer/pointer
elements of the same struct. Then they just need to be spilled out
onto the stack for the function call, either in the caller (for
operands beyond the first 4 words' worth) or in the callee (typically
in the function preamble).

The callee loads them into VFP registers; at hardware level, this
happens via a lookaside to L1, so it's basically free as far as memory
traffic goes. As long as enough useful work can be scheduled in the
callee to cover the VLD latency, it's all good. That's one reason why
conventional benchmarks of hardfp vs. softfp don't show any benefit on
real code. (Who writes code that has inner loops over publicly
visible APIs in which both caller and callee do floating point or SIMD
arithmetic on the same values -- and thus produces a noticeable
pipeline stall from spilling a computed parameter out of the VFP bank
and then back in? One doing arithmetic, and the other doing loads/
stores, simply doesn't count.)

Back to the specific example I cited: Under the hardfp convention,
the floating point operands have to get moved over to the VFP side
before the function call, which would involve two VMOVs per 64-bit
operand. That's stupid, so instead it gets done by a spill to stack
followed by a VLD, or by a separate load from the original structure.
This may no longer be in L1, of course; so there's an opportunity for
the compiler to screw up; a well written compiler shouldn't. So
basically, there's going to be a VLD from stack either right before or
right after the branch. The net effect is almost certainly trivial --
as I said -- but either hardfp or softfp could be a (slight) win.

> That's probably a trivial effect; but at least on Cortex-A8, there are
> others that hit some code bases much harder. What if the callee does
> no arithmetic, but passes the argument to a variadic function? Or the
> callee returns a value fetched from memory, which happens to be
> floating point, and the caller turns around and sticks it into an
> otherwise integer-filled structure? Either way, you take the full hit
> of the transfer to D0 and back to the integer side, for nothing.

You seem to be missing something about how structs are actually
represented at the backend of a compiler.

Educate me. I say I have a double X in a struct in memory, which I
want to pass to non-variadic function A, which then passes it to
variadic function B. The hardfp convention requires that I pull X
into D0 before branching to A, which has to move it from D0 to r0+r1
before passing it to B. What about "how structs are actually
represented at the backend of a compiler" saves me from the overhead
of this maneuver, relative to the softfp convention (in which X is in
r0+r1 for the call to A and needn't be touched before A calls B)?

In the second example in that paragraph, I call function C, which
returns a double Y (fetched from memory, not computed). I want to
stick this in a struct along with integer J and pointer Q. The hardfp
convention requires that Y be returned in D0, and to get it into the
struct I may need to issue three separate stores (STR, VSTR, STR --
assuming Y is between J and Q and I'm exploiting address post-
increment). In the softfp convention, Y will be returned in r0+r1,
and all I have to do is shuffle it into appropriate registers and
issue one STM.

This is, of course, all small stuff. All that I'm trying to show is
that one shouldn't look for system-wide wins from the hardfp ABI in
the "obvious" places, because 1) real code doesn't often do things
that cause softfp to lose significantly, and 2) real code does often
do things that cause hardfp to lose slightly. To make hardfp win, you
have to exploit its "invisible" benefits, which are mostly about
covering memory latencies by using Q0-Q3 to pass values into and out
of functions that are *still in-flight* as cache-line-sized memory
transactions.

>> Are you saying increased use of NEON by gcc will make hardfp calls
>> slower?

> The reverse; but I can understand your reading my contorted syntax
> that way. I expect that GCC will get better at using the NEON unit
> for non-floating-point purposes. That will make it worthwhile for
> core libraries, from eglibc and libstdc++ on up, to adapt their
> internal calling conventions to permit the sort of "stupid
> rescheduling tricks" that win when building hardfp.

> You may say that it shouldn't matter for APIs that aren't "publicly
> visible", and that no human-readable API should do stupid things like
> pass an opaque operand in Q0-Q3 and return it unchanged as its return
> value (still in Q0-Q3).

Such a constraint cannot be expressed in a C API (nor a C++ one AFAIK).
To make that work, you'd have to either:

1. Change the ABI spec.
2. Teach the compiler extended semantics about specific functions in the
same way it already recognises many standard library calls.
3. Write all code by hand in assembler with no standard calling
conventions at all.

None of these seem particularly compelling, nor likely to happen.

4. None of the above. Simply change the calling conventions on your
inner functions from
    int myfunc(char* p, double x)
to
    c64byte_t myfunc(c64byte_t blob, int* result, char* p, double x)
and replace each "return r;" with "*result = r; return blob;". Call
sites change from
    n = myfunc(q, y);
to
    blob = myfunc(blob, &n, q, y);

This is only useful if you want to reorder -- by hand; the compiler
won't do it for you -- a fetch to "blob", from after the call to
myfunc() to before it. But that's exactly what I want to do a lot of
the time, because the real return value of myfunc() needs to be stuck
into a data structure that isn't in cache. So I want to go ahead and
schedule the fetch of this data structure into Q0-Q3 before the call
to myfunc(); execute the body of myfunc() while the fetch is still in
flight; and update the data structure before storing it right back.

In many cases, I want to bypass the cache hierarchy entirely in both
directions, because that data structure probably won't be touched
again until after it has aged out of L2 anyway. So the fetch and
store of "blob" are done via NEON intrinsics through a pointer that
lies in an uncacheable mapping. Currently this is another constraint
that cannot be expressed in a C or C++ API; but I don't intend to let
that stop me, either.

>> The A9 and later indeed make the softfp calls less costly, reducing any
>> advantage hardfp might have (which is already small in benchmarks on A8).

> Even the idea that A9 is less friendly *overall* to hardfp than A8 is
> debatable, at the current level of compiler implementation.

The A9 is not in any way "less friendly" to hardfp. It is, however,
less hostile to softfp.

Its cache hierarchy is different, in ways that are not fully described
in the TRM. Its automatic prefetch mechanism is also still somewhat
unproven, especially on the load patterns I care about. I consider it
debatable that it is either "less friendly" to hardfp or "less
hostile" to softfp in any way that matters. But as I said before, I
don't really wish to debate it without data.

>> Do you have any numbers to back this up? I don't see how going through
>> NEON registers would be faster than direct LDM/STM on any core.

> I will produce those numbers within the month, or admit defeat.
> Seriously, I'd better be able to substantiate this by mid-July or so,
> or my team is going to have to rethink certain aspects of one of its
> current development efforts.

I'm glad I'm not invested in that effort.

On this I suppose we agree. You have an admirable track record as a
coder, and clearly also a deep understanding of some aspects of the
OMAP chip series. But you seem awfully sure that your bag of tricks
contains all the tricks that matter. That attitude gets tiresome
after a while.

>> The out of order issue on A9 and later makes most such tricks unnecessary.

> Er, no. Out of order issue helps reduce bubbles in the ALU for math-
> intensive loads whose working set fits in cache.

Out of order issue potentially allows a load to be issued sooner than it
appears in the instruction stream, thus hiding some of the latency
whether it hits L1 or not.

As I understand it, the A9's out-of-order execution capabilities are
not on the scale that would be needed to cover latency to DRAM. I'm
aware of how speculative loads and stride-detection-based auto-
prefetching work, and they certainly have their uses. But as much as
I would like to believe that trampolining loads through the NEON will
be unnecessary on the A9, my experience with the much more extensive
out-of-order capabilities of server-class 64-bit architectures leads
me to believe otherwise.

>> To the extent scheduling across function calls is permitted by the C
>> standard, the manner of passing parameters has no bearing on such
>> optimisations.

> OK, I admit that I'm planning to cheat here. I'm going to keep state
> that the compiler would otherwise allocate to the callee-save
> registers in Q0-Q3, and keep passing this block into and back out of
> mostly non-floating-point-using APIs, which effectively makes it
> callee-save state that doesn't wind up being touched by the callee.

So you've modified the ABI again.

You can call it that if you like. Unlike actually "modifying the
ABI", this doesn't involve any change to the compiler. So I like to
think that I'm modifying a layer between the human-visible API and the
actual ABI, in much the way that C++ iostreams and templates like
Glib::ustring::compose() do.

> When combined with the neon-d16-fp16 model, this should induce the
> compiler to use Q4-Q7 as its NEON working set. Since it knows this
> range is callee-save, it's safe to schedule loads with ample provision
> for cache miss latency, even if it has to move them across function/
> method calls.

So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.

Now who's waving around straw men? The load patterns that I'm worried
about don't often use NEON for algorithms that need 32 8-byte
registers. Yes, having that full bank of registers makes libjpeg-
turbo's iDCT more compact; but I don't much care, because JPEG decode
latency is not the most critical thing in my system.

Back-of-the-envelope calculations say that the single most critical
resource in *my* system is DRAM bandwidth, and that I will need to go
to quite a bit of effort to keep from frittering it away with word-
sized loads from uncacheable regions and read-modify-write cycles on
partially clobbered cache lines. Until I have benchmarks that say
differently, I'm going to focus on altering the CPU behavior to use
the memory interface efficiently rather than the other way around.
From that perspective, NEON registers are mostly placeholders for in-
flight memory transactions, and I hope to allocate them where they
will do the most good.

>> If a function is fully inlined, the compiler can of course do whatever
>> it pleases. That is the entire point of inlining.

> I think it's a little subtler than that in C++; but I am no language
> lawyer. Suffice it to say that what the compiler does *in practice*
> appears to be heavily influenced by whether there is any way for the
> method to be called through a "publicly visible" symbol.

A function identifiable as a symbol, public or not, is by definition not
inlined. It is perfectly legal for the compiler to inline some or all
calls to a function while still producing a symbol with a valid entry
point for it. If this happens, this symbol must of course behave
according to ABI rules. For the inlined "calls", there is no ABI-level
call, and thus calling conventions no longer apply.

Have you ever written a pure-header C++ library? I have. The rules
about what constitutes a "publicly visible" symbol are actually quite
intricate when they crop up, not in a "library" .o file, but in one or
more of the application-level .o files compiled against the same set
of headers. The compiler has to apply the same rules to produce
equivalent implementations of the same method in each .o, so whichever
one winds up surviving the link step can fill in for all the others.
Liberal application of __attribute__(always_inline) helps; but this
does mix strangely with std::tr1::mem_fn, whose implementation I find
quite opaque.

In short: maybe the compiler is free to disregard the ABI on anything
it chooses to (or is forced to) inline. But that doesn't necessarily
mean that it finds every possible ABI-breaking optimization without
some hints from the library programmer. Compiler writers are human
too, and can't be expected to think of all the stupid things people
like me want to coerce the compiler into doing.

In summary, you have created your own ABI that reserves most of the
VFP/NEON registers for special uses that conflict with how AAPCS/VFP
passes floating-point arguments to functions. You then use this as
foundation for a series of contradicting arguments for and/or against
the hardfp ABI over softfp.

My own ABI? Not really. More like my own target CPU model, and my
own techniques for wringing performance out of the hardfp ABI;
although there's really nothing original in them. I stand on the
shoulders of giants.

Contradicting arguments? I don't think so, except insofar as I was in
error on a couple of points the first time around, and tried to
correct that after you helpfully pointed out the error. If there
remain contradictions, please do point them out, and I'll attempt to
resolve them.

For and against hardfp? Yes, because hardfp does have retrograde
cases, and you have to work pretty hard to get much value out of it.
Still, I think the game is worth the candle, and I intend to prove
it. Thanks for stimulating me to articulate how.

Cheers,
- Michael

Siarhei_Siamashka · June 27, 2011, 2:15pm

I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.

As I wrote, "extrapolate these to the hard-float case". If you look
at the code a bit, perhaps you can see the potential benefit of
refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
using compiler intrinsics rather than raw assembly, and letting the
compiler handle register allocation and load/store latencies? And of
rewriting idct_helper and transpose_4x4 as inline functions, operating
on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte chunk of
data passed by value? That's exactly what the datatypes defined in
AAPCS are for.

Yeah, this sounds great in theory, and this is what the compiler
people want us to believe. But the reality is rather disappointing:
43725 – Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

In many cases, I want to bypass the cache hierarchy entirely in both
directions, because that data structure probably won't be touched
again until after it has aged out of L2 anyway. So the fetch and
store of "blob" are done via NEON intrinsics through a pointer that
lies in an uncacheable mapping. Currently this is another constraint
that cannot be expressed in a C or C++ API; but I don't intend to let
that stop me, either.

Why would you want to read uncached memory? That's already a huge
performance loss. For example, there is "shadow framebuffer" in
xf86-video-fbdev driver, which exists specifically to get more or less
reasonable performance when attempting to read pixel data back.
Moreover, you can easily enable write-through caching for the
framebuffer on OMAP3 systems, which can be used instead of the shadow
framebuffer with some really good performance results.

So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.

Now who's waving around straw men? The load patterns that I'm worried
about don't often use NEON for algorithms that need 32 8-byte
registers. Yes, having that full bank of registers makes libjpeg-
turbo's iDCT more compact; but I don't much care, because JPEG decode
latency is not the most critical thing in my system.

if you don't care about having any real NEON optimizations in your
system (for JPEG or anything else), then it's surely your choice. It's
the great freedom of open source, etc. But I seriously doubt that
anyone else would be interested

Your post was very verbose and I'm sorry for not replying to the rest
of it. At least looks like you can find the relevant documentation,
read it and (mis)interpret somehow The question remains whether
you can actually use all of this information in practice to your
advantage. And if you find some really good performance tricks with
the hardfp, ARM or VFP/NEON code, then I would be surely very
interested to look at the compilable examples and benchmark numbers.

mansr · June 27, 2011, 2:51pm

"Edwards, Michael" <m.k.edwards@gmail.com> writes:

"Edwards, Michael" <m.k.edwa...@gmail.com> writes:
>> Where did you get that notion. There is nothing in the ARM ABI docs to
>> support it. In fact, the paragraph quoted above directly contradicts
>> your claim.

> You're absolutely right. Q4-Q7 are just as callee-save under the
> softfp ABI as they are under the hardfp ABI. The only additional
> *explicit* state that the official hardfp convention allows one to
> preserve -- not trivially, but with some effort -- is Q0-Q3. (That
> can be done by systematically altering your otherwise non-floating-
> point-using APIs.)

I fail to make sense of that paragraph. D0-D7 are call-clobbered, no
exceptions. If they are not used for arguments, the callee may still
use them as scratch registers.

As I wrote later in that message, I intend to adjust some of my inner
APIs to take an extra argument that maps to Q0-Q3 / D0-D7, and return
its value as their return value. That should have the effect of
leaving those registers untouched across the function / method call.
And if the callee has use for them, it can always save and restore
them. The compiler won't really know about this convention, so it
won't reschedule loads to Q0-Q3 across the function call; but
otherwise you're right, I'm adapting the ABI to my needs without
modifying the compiler.

If the compiler doesn't know your functions are required to preserve
q0-q3, it will have to assume they are clobbered by a call.

> I work on embedded systems where I control how all the code is
> compiled, and I compile for a neon-d16-fp16 model that doesn't
> correspond to any real hardware.

Any NEON implementation is required to have the full set of 32 D
registers. If you allow NEON, there is no point in restricting the
number of registers. (For pure VFP code, doing so allows the same code
to be used on both full and reduced register set implementations, at a
slight performance cost.)

As I think I explained, the point of restricting the number of
registers used in userland code is to leave them free for use in
kernel code,

So either you are right and every kernel developer I've ever heard of is
wrong, or there is nothing significant to be gained from using NEON in
kernel (outside a few isolated areas like RAID checksumming and some
crypto functions, as was recently discussed).

This raises another commonly misunderstood point. It can actually be
advantageous to compile most userland code without NEON, even for
memcpy/strcpy. That's because the kernel doesn't have to save/restore
FPU state on context switch for processes that have not touched the
FPU since the last context switch. If you choose to tune your kernel
this way, the VFP/NEON unit will be disabled on exit from the context
switch path. The first FPU instruction issued from userland will
generate an illegal instruction trap, which the kernel will catch; it
will restore the process's FPU context and reissue (or emulate) the
instruction that trapped. I think that in recent kernels you can opt
out of this "lazy restore" mechanism -- either at kernel configuration
time or per-process -- and if you use a NEONized memcpy, you probably
should.

It can also be an advantage to have 16 rather than 32 VFP registers,
because you have half as much context to save and restore. However,
no NEON implementation of which I'm aware can be told to trap on
access to D16-D31. So if your hardware has the "full" VFP/NEON
register set, you have to save/restore the full set, even for
processes whose code is compiled for a vfpv3-d16 model -- because
that's not part of the ABI contract. Unless, of course, you control
the compilation every line of code on your embedded system, in which
case you can do what you want. (You still have to audit the assembly
code throughout your system for use of the upper half of the VFP bank;
I plan to run for a while with 0xdeadbeef tell-tales, verified in the
context switch code path, to catch whatever my static analysis
misses.)

Are you saying everybody else is imagining their code running orders of
magnitude faster with NEON than without?

> I intend to reserve the upper half of the VFP/NEON register bank for
> use in-kernel, so I can trampoline data moves through D16-D31 without
> having to save userland's content and restore it afterwards. (Not
> because saving and restoring them is expensive, but because it would
> have to be done from a place in the kernel where the FPU context- save
> thingy is handy. I'd rather just use Q8-Q15 as scratch registers
> anywhere in the kernel I want to, with nothing to save/restore but the
> FPSCR.)

I can't imagine the cost of stealing these registers from heavy
float/simd users being compensated by a few minor savings in the kernel.

Well, I tried to explain the part about keeping save/restore overhead
down. I can add a couple of things: unlike ARM-side
"registers" (which are really just labels in the instruction stream,
and are allocated from a larger pool of physical registers),

The A9 and up use register renaming from a larger pool. The A8 is fully
in-order and thus has no need for this.

NEON registers are locked to real hardware locations.

On the A15 NEON registers are allocated from the same pool as core
registers.

So if the kernel needs to spill userland's values from D16-D31 in
order to use them for bulk data moves, the store operation is going to
stall waiting for the completion of any outstanding userland-initiated
pipeline activity involving them. And on the return to userland, the
load operation that restores their contents will have to complete
before the user process can really get going again.

On a context switch it is sometimes necessary to stall in order for any
potential exceptions to be taken in the correct context. Once a store
has cleared all such checks, there is no need to block waiting for it to
hit actual RAM/cache.

So for short trips into and out of kernel

Short trips into the kernel are generally considered murder for
performance for a number of other reasons, even when the kernel does not
touch the VFP context at all.

Perhaps someone else could try rephrasing in language Måns might find
more enlightening -- or correcting me if I'm wrong, which is always
possible. Otherwise, I guess we're going to have to wait until the
benchmarks are in. Obviously, if reserving D16-D31 for kernel use
doesn't prove to be a win in our full system, we won't do it. But my
measure of "win" may be different from yours. I don't care about
maximizing the idle fraction of CPU; I care about making my system's
UI as responsive and jitter-free as possible, even though the bulk of
the SoC's throughput to DRAM is occupied by video capture/encode/
decode/display traffic.

The memory system is probably the weakest point in the A8. Having more
registers often means doing fewer loads and stores, which translates
directly into higher throughput.

> And even in some non-template libraries; you might take a look at the
> NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
> those to the hard-float case.

I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.

As I wrote, "extrapolate these to the hard-float case".

I still don't understand what you meant by that.

If you look at the code a bit, perhaps you can see the potential
benefit of refactoring libjpeg-turbo so that jsimd_idct_ifast_neon()
is written using compiler intrinsics rather than raw assembly, and
letting the compiler handle register allocation and load/store
latencies?

No compiler has ever beaten me at either of those tasks, not even come
close.

And of rewriting idct_helper and transpose_4x4 as inline functions,
operating on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte
chunk of data passed by value?

The existing hand-written IDCT is close to as fast as it can possibly be
done without sacrificing precision. Introducing hundreds of ways for
the compiler to screw up is not going to make it any faster.

>> Using VFP register parameters (i.e. doing nothing) is never less
>> efficient than moving them to core registers (doing something).

> On the contrary; hardfp can definitely be a net lose on real code.
> Consider cases where the outer function slings structures with mixed
> integers and floats, and the inner function does the actual floating
> point arithmetic. The hardfp convention requires the caller to
> transfer floating point parameters into VFP registers before entering
> the function, rather than leaving them in integer registers (where
> they can be put for free, because they are already in L1).

Sounds like that API really ought to be passing a pointer to a struct,
not passing the struct by value.

The inner function doesn't know anything about the struct; it operates
on bare floats/doubles. The outer function slings mixed structures,

Can you please provide an accurate technical definition of what it
entails to "sling mixed structures"?

and as soon as it touches them at all, it has them in L1. Under the
softfp convention, the outer function can pull the floating point
operands of the inner function into integer registers any time it's
convenient, maybe as part of an LDM that pulls in some integer/pointer
elements of the same struct. Then they just need to be spilled out
onto the stack for the function call, either in the caller (for
operands beyond the first 4 words' worth) or in the callee (typically
in the function preamble).

If using hardfp, the caller can load the values directly into VFP
registers at any convenient time, and there is nothing further to be
done. That cannot be slower in any way.

The callee loads them into VFP registers; at hardware level, this
happens via a lookaside to L1, so it's basically free as far as memory
traffic goes.

Having the values already in registers is also free.

As long as enough useful work can be scheduled in the callee to cover
the VLD latency, it's all good.

If the values are loaded by the caller, there is possibly more room to
schedule the loads efficiently.

Back to the specific example I cited: Under the hardfp convention,
the floating point operands have to get moved over to the VFP side
before the function call,

Or loaded directly there with VLDR or VLDM.

which would involve two VMOVs per 64-bit operand.

A single VMOV can move two 32-bit core registers into one VFP D
register, i.e. one VMOV per 64-bit value, and that's only needed if for
some weird reason the float values were sitting in core registers.
Under a hardfp ABI, there is rarely any reason for them to be there,
rather they'd be loaded directly from memory or transferred from
wherever some prior floating-point computation placed the result.

That's stupid, so instead it gets done by a spill to stack followed by
a VLD, or by a separate load from the original structure. This may no
longer be in L1, of course; so there's an opportunity for the compiler
to screw up; a well written compiler shouldn't. So basically, there's
going to be a VLD from stack either right before or right after the
branch.

The only time there will necessarily be a stack access is if passing
more arguments than fit in registers. For hardfp, that's 8 double
precision or 16 single precision values in addition to any integer or
pointer values. Having functions with that many arguments is rare
indeed. On the other hand, if using softfloat calls, only 2 double (4
single) float values may be passed in registers, and one or more
argument is likely to end up on the stack. To summarise, a hardfloat
call looks like this:

1. Load arguments to VFP registers
2. Call function

A softfloat call looks like this:

1. Load values to registers
2. Store values on stack
3. Call function
4. Load values from stack to registers

You are saying 4 steps are more efficient than 2.

> That's probably a trivial effect; but at least on Cortex-A8, there are
> others that hit some code bases much harder. What if the callee does
> no arithmetic, but passes the argument to a variadic function? Or the
> callee returns a value fetched from memory, which happens to be
> floating point, and the caller turns around and sticks it into an
> otherwise integer-filled structure? Either way, you take the full hit
> of the transfer to D0 and back to the integer side, for nothing.

You seem to be missing something about how structs are actually
represented at the backend of a compiler.

Educate me. I say I have a double X in a struct in memory, which I
want to pass to non-variadic function A, which then passes it to
variadic function B. The hardfp convention requires that I pull X
into D0 before branching to A, which has to move it from D0 to r0+r1
before passing it to B. What about "how structs are actually
represented at the backend of a compiler" saves me from the overhead
of this maneuver, relative to the softfp convention (in which X is in
r0+r1 for the call to A and needn't be touched before A calls B)?

This situation is hardly common (printf calls for debugging aside), and
certainly shouldn't be in any performance-critical code.

In the second example in that paragraph, I call function C, which
returns a double Y (fetched from memory, not computed). I want to
stick this in a struct along with integer J and pointer Q. The hardfp
convention requires that Y be returned in D0, and to get it into the
struct I may need to issue three separate stores (STR, VSTR, STR --
assuming Y is between J and Q and I'm exploiting address post-
increment). In the softfp convention, Y will be returned in r0+r1,
and all I have to do is shuffle it into appropriate registers and
issue one STM.

This is again a fairly contrived example, and a bad one at that. Any
sane person would order that struct as {int, int, double} to minimise
padding. This would allow storing the int values using strd or stm and
the double with vstr, which sequence takes no longer than an stm with
more registers.

This is, of course, all small stuff. All that I'm trying to show is
that one shouldn't look for system-wide wins from the hardfp ABI in
the "obvious" places, because 1) real code doesn't often do things
that cause softfp to lose significantly, and 2) real code does often
do things that cause hardfp to lose slightly. To make hardfp win, you
have to exploit its "invisible" benefits, which are mostly about
covering memory latencies by using Q0-Q3 to pass values into and out
of functions that are *still in-flight* as cache-line-sized memory
transactions.

Now you are arguing for hardfp again. A paragraph ago you were going to
great lengths to find examples where it could theoretically make things
slower.

>> Are you saying increased use of NEON by gcc will make hardfp calls
>> slower?

> The reverse; but I can understand your reading my contorted syntax
> that way. I expect that GCC will get better at using the NEON unit
> for non-floating-point purposes. That will make it worthwhile for
> core libraries, from eglibc and libstdc++ on up, to adapt their
> internal calling conventions to permit the sort of "stupid
> rescheduling tricks" that win when building hardfp.

> You may say that it shouldn't matter for APIs that aren't "publicly
> visible", and that no human-readable API should do stupid things like
> pass an opaque operand in Q0-Q3 and return it unchanged as its return
> value (still in Q0-Q3).

Such a constraint cannot be expressed in a C API (nor a C++ one AFAIK).
To make that work, you'd have to either:

1. Change the ABI spec.
2. Teach the compiler extended semantics about specific functions in the
same way it already recognises many standard library calls.
3. Write all code by hand in assembler with no standard calling
conventions at all.

None of these seem particularly compelling, nor likely to happen.

4. None of the above. Simply change the calling conventions on your
inner functions from
    int myfunc(char* p, double x)
to
    c64byte_t myfunc(c64byte_t blob, int* result, char* p, double x)
and replace each "return r;" with "*result = r; return blob;". Call
sites change from
    n = myfunc(q, y);
to
    blob = myfunc(blob, &n, q, y);

This is only useful if you want to reorder -- by hand; the compiler
won't do it for you -- a fetch to "blob", from after the call to
myfunc() to before it. But that's exactly what I want to do a lot of
the time, because the real return value of myfunc() needs to be stuck
into a data structure that isn't in cache. So I want to go ahead and
schedule the fetch of this data structure into Q0-Q3 before the call
to myfunc(); execute the body of myfunc() while the fetch is still in
flight; and update the data structure before storing it right back.

For this to work, the compiler must know that q0-q3 are preserved by the
call, or it will save and restore these registers if they hold live
values, which defeats the purpose of doing the loads early.

>> The A9 and later indeed make the softfp calls less costly, reducing any
>> advantage hardfp might have (which is already small in benchmarks on A8).

> Even the idea that A9 is less friendly *overall* to hardfp than A8 is
> debatable, at the current level of compiler implementation.

The A9 is not in any way "less friendly" to hardfp. It is, however,
less hostile to softfp.

Its cache hierarchy is different,

I don't see how that is relevant whatsoever to the floating-point
calling convention.

>> Do you have any numbers to back this up? I don't see how going through
>> NEON registers would be faster than direct LDM/STM on any core.

> I will produce those numbers within the month, or admit defeat.
> Seriously, I'd better be able to substantiate this by mid-July or so,
> or my team is going to have to rethink certain aspects of one of its
> current development efforts.

I'm glad I'm not invested in that effort.

On this I suppose we agree. You have an admirable track record as a
coder, and clearly also a deep understanding of some aspects of the
OMAP chip series. But you seem awfully sure that your bag of tricks
contains all the tricks that matter. That attitude gets tiresome
after a while.

I get suspicious when someone claims to that everybody else is doing it
all wrong without showing any hard data to prove it. That is all.

> When combined with the neon-d16-fp16 model, this should induce the
> compiler to use Q4-Q7 as its NEON working set. Since it knows this
> range is callee-save, it's safe to schedule loads with ample provision
> for cache miss latency, even if it has to move them across function/
> method calls.

So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.

Back-of-the-envelope calculations say that the single most critical
resource in *my* system is DRAM bandwidth,

Then you should probably not be using an OMAP chip at all.

Edwards_Michael · June 29, 2011, 8:29am

Apologies in advance for the probable HTML nastiness; I am reading this through Google Groups, and I see no obvious way to de-HTML this in the new interface.

I haven’t looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.

As I wrote, “extrapolate these to the hard-float case”. If you look
at the code a bit, perhaps you can see the potential benefit of
refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
using compiler intrinsics rather than raw assembly, and letting the
compiler handle register allocation and load/store latencies? And of
rewriting idct_helper and transpose_4x4 as inline functions, operating
on the 8x8 block of 16-bit coefficients – i. e., a 128-byte chunk of
data passed by value? That’s exactly what the datatypes defined in
AAPCS are for.

Yeah, this sounds great in theory, and this is what the compiler
people want us to believe. But the reality is rather disappointing:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Linaro GCC 4.6.x seems to do a great deal better job than the 4.5.x series of eliminating temporary variables and grouping registers sensibly for spilling to stack. Still a little buggy in places (https://bugs.launchpad.net/gcc-linaro/+bug/803232), but I have great hopes for it.

In many cases, I want to bypass the cache hierarchy entirely in both
directions, because that data structure probably won’t be touched
again until after it has aged out of L2 anyway. So the fetch and
store of “blob” are done via NEON intrinsics through a pointer that
lies in an uncacheable mapping. Currently this is another constraint
that cannot be expressed in a C or C++ API; but I don’t intend to let
that stop me, either.

Why would you want to read uncached memory? That’s already a huge
performance loss. For example, there is “shadow framebuffer” in
xf86-video-fbdev driver, which exists specifically to get more or less
reasonable performance when attempting to read pixel data back.
Moreover, you can easily enable write-through caching for the
framebuffer on OMAP3 systems, which can be used instead of the shadow
framebuffer with some really good performance results.

It doesn’t have to be a performance loss system-wide, which is what I care about. The data has to get out of cache for the GPU to be able to use it – or an on-chip DSP core, or an H.264 encode block, or whatever. And even when you’re just talking about CPU algorithms, when the data isn’t in cache – as is inevitably the case sometimes when your working set is larger than cache – you’ve got to get it in somehow. You can let the cache controller do the work for you, or you can make a conscious distinction between the “hot set” and the broader working set, and access the latter through an uncacheable mapping to keep it from evicting the former from cache. The ARMv7-A+NEON is extraordinarily well suited to the explicit strategy, if you put a bit of work into it.

So you’ve reduced the number of NEON registers from 32 to 8, and you’re
hoping this will somehow improve performance. The mind boggles.

Now who’s waving around straw men? The load patterns that I’m worried
about don’t often use NEON for algorithms that need 32 8-byte
registers. Yes, having that full bank of registers makes libjpeg-
turbo’s iDCT more compact; but I don’t much care, because JPEG decode
latency is not the most critical thing in my system.

if you don’t care about having any real NEON optimizations in your
system (for JPEG or anything else), then it’s surely your choice. It’s
the great freedom of open source, etc. But I seriously doubt that
anyone else would be interested

Well, I ran a little trial. I converted the libjpeg-turbo trunk implementation of 8x8 iDCT from NEON assembly into NEON compiler intrinsics, and let GCC do the register / memory management. I have some initial benchmark results; they could be totally wrong (as the implementation almost certainly contains errors), but they feel about right based on an inspection of the compiler-generated assembly code.

Even as sloppy as 4.5.x is at managing the NEON register pool, it only loses about 10% on decompression throughput relative to the hand-coded assembly version. And compiling for my 16-register NEON model only loses 25% relative to that. Even if GCC 4.6.x didn’t recoup part or all of that performance loss – which I am quite confident that it will – it would be worth it in exchange for the ability to move data around quickly in kernel code. Our system does not exist primarily to decompress JPEGs.

Your post was very verbose and I’m sorry for not replying to the rest
of it. At least looks like you can find the relevant documentation,
read it and (mis)interpret somehow The question remains whether
you can actually use all of this information in practice to your
advantage. And if you find some really good performance tricks with
the hardfp, ARM or VFP/NEON code, then I would be surely very
interested to look at the compilable examples and benchmark numbers.

If you are interested, please do take a look at the example at crosstool-ng/patches/libjpeg-turbo/trunk/0001-Implement-jsimd_idct_ifast-using-NEON-intrinsics.patch at master · mkedwards/crosstool-ng · GitHub . It probably has bugs of the sign-flip / off-by-one sort; libjpeg-turbo’s unit test coverage doesn’t seem to extend to the 8x8 iDCT, and I haven’t done much more than compile it and run “pro forma” benchmarks yet. I would be grateful for any help you care to provide with testing, bug fixing, and benchmarking.

–
Best regards,
Siarhei Siamashka

Cheers,

Michael

Siarhei_Siamashka · June 29, 2011, 2:59pm

>> I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
>> and colourspace conversions. Nowhere are floats or simd vectors passed
>> by value to a function, at least not where it matters for performance.
>
> As I wrote, "extrapolate these to the hard-float case". If you look
> at the code a bit, perhaps you can see the potential benefit of
> refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
> using compiler intrinsics rather than raw assembly, and letting the
> compiler handle register allocation and load/store latencies? And of
> rewriting idct_helper and transpose_4x4 as inline functions, operating
> on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte chunk of
> data passed by value? That's exactly what the datatypes defined in
> AAPCS are for.

Yeah, this sounds great in theory, and this is what the compiler
people want us to believe. But the reality is rather disappointing:
43725 – Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

Linaro GCC 4.6.x seems to do a great deal better job than the 4.5.x series
of eliminating temporary variables and grouping registers sensibly for
spilling to stack.

Does it work much better for you when compiling the test code from
that gcc bugreport? I could not see any improvements with
gcc-linaro-4.6-2011.06-0.tar.bz2 myself. Or do you have some other
code examples which show great progress?

I know that computers have already beaten humans at playing chess
(even though it took them quite a long time to achieve this). Maybe
one day they will learn how to schedule code for processor pipelines
better than human software developers can do. But at the current rate,
it does not seem to happen any time soon.

Still a little buggy in places
(https://bugs.launchpad.net/gcc-linaro/+bug/803232), but I have great hopes
for it.

It does not look "a little buggy" to me. You managed to hit a compiler
bug immediately after just a single attempt of trying to implement
something not totally trivial with NEON intrinsics. That's more like
an impressive 1/1 failure rate

> In many cases, I want to bypass the cache hierarchy entirely in both
> directions, because that data structure probably won't be touched
> again until after it has aged out of L2 anyway. So the fetch and
> store of "blob" are done via NEON intrinsics through a pointer that
> lies in an uncacheable mapping. Currently this is another constraint
> that cannot be expressed in a C or C++ API; but I don't intend to let
> that stop me, either.

Why would you want to read uncached memory? That's already a huge
performance loss. For example, there is "shadow framebuffer" in
xf86-video-fbdev driver, which exists specifically to get more or less
reasonable performance when attempting to read pixel data back.
Moreover, you can easily enable write-through caching for the
framebuffer on OMAP3 systems, which can be used instead of the shadow
framebuffer with some really good performance results.

It doesn't have to be a performance loss *system-wide*, which is what I care
about. The data has to get out of cache for the GPU to be able to use it --
or an on-chip DSP core, or an H.264 encode block, or whatever.

The write-through cache is supposed to ensure that the data also
reaches memory (almost) immediately after it gets modified in cache.
In the other cases, cache flush/invalidate operations can be used to
synchronize the content of CPU cache and memory. Android people seem
to be advocating the use of cached framebuffers too:
http://www.kandroid.org/online-pdk/guide/display_drivers.html
And for OMAP3 it is possible to have multiple framebuffer planes
composited together by the display controller. GPU can potentially
handle it's own overlay, while reserving GFX plane entirely for the
CPU.

I don't know anything about OMAP3 DSP or H.264 block. This is the area
where the other people definitely have a lot more experience.

And even when you're just talking about CPU algorithms, when the data isn't in cache
-- as is inevitably the case sometimes when your working set is larger than
cache -- you've got to get it in somehow. You can let the cache controller
do the work for you, or you can make a conscious distinction between the
"hot set" and the broader working set, and access the latter through an
uncacheable mapping to keep it from evicting the former from cache.

This is something where I would prefer benchmark results. Something
like the following code can be a good start:
http://lists.freedesktop.org/archives/pixman/attachments/20110404/89d0c373/attachment.c

NEON in newer Cortex-A8 processors can be indeed used for performing
fast copying of data for the "cached->cached" or "uncached->cached"
cases even without explicit prefetch via PLD instructions. But it's
not a silver bullet and some other limitations apply. Still it makes a
perfect implementation for memcpy function, which needs to work on
OMAP3630/DM3730.

The ARMv7-A+NEON is extraordinarily well suited to the explicit strategy, if you
put a bit of work into it.

It's more like not ARMv7-A+NEON in general, but specifically ARM
Cortex-A8 processors having revision 2 or newer (those which do not
require L1NEON workaround). The other NEON capable ARM processors may
be less suited for this strategy.

>> So you've reduced the number of NEON registers from 32 to 8, and you're
>> hoping this will somehow improve performance. The mind boggles.
>
> Now who's waving around straw men? The load patterns that I'm worried
> about don't often use NEON for algorithms that need 32 8-byte
> registers. Yes, having that full bank of registers makes libjpeg-
> turbo's iDCT more compact; but I don't much care, because JPEG decode
> latency is not the most critical thing in my system.

if you don't care about having any real NEON optimizations in your
system (for JPEG or anything else), then it's surely your choice. It's
the great freedom of open source, etc. But I seriously doubt that
anyone else would be interested

Well, I ran a little trial. I converted the libjpeg-turbo trunk
implementation of 8x8 iDCT from NEON assembly into NEON compiler intrinsics,
and let GCC do the register / memory management. I have some initial
benchmark results; they could be totally wrong (as the implementation almost
certainly contains errors), but they feel about right based on an inspection
of the compiler-generated assembly code.

Even as sloppy as 4.5.x is at managing the NEON register pool, it only loses
about 10% on decompression throughput relative to the hand-coded assembly
version.

I can't verify these numbers because none of the gcc versions that I
have is able to compile your code (because of that neon intrinsics bug
that you have already reported to linaro).

Was 10% loss measured for jpeg decoding performance overall, or for
iDCT function alone? But in any case, 10% performance loss is already
bad enough, especially considering that your variant basically
directly converts NEON instructions to the corresponding intrinsics.
In this case gcc does not even need to do much job on its own, and the
"register / memory management" should be pretty simple and
straightforward for it.

And compiling for my 16-register NEON model only loses 25%
relative to that.

Do you have some gcc patches for this 16-register NEON model available
somewhere?

Even if GCC 4.6.x didn't recoup part or all of that
performance loss -- which I am quite confident that it will -- it would be
worth it in exchange for the ability to move data around quickly in kernel
code. Our system does not exist primarily to decompress JPEGs.

The performance loss on decoding JPEGs is clearly bad. Also you
sacrifice the possibility of running almost all the existing ARM NEON
code available around. And all of this is needed to gain what?

Edwards_Michael · July 2, 2011, 6:35am

Discussion around the libjpeg-turbo iDCT implementation using NEON intrinsics has moved here: https://bugzilla.mozilla.org/show_bug.cgi?id=496298 .

My gcc patches are here: https://github.com/mkedwards/crosstool-ng/tree/master/patches/gcc/linaro-4.6-bzr . They’re in the context of a version of crosstool-ng that I’ve adapted to build a Linaro-based toolchain and a reasonably complete sysroot environment. It’s a little ragged around the edges, but if you would like help getting it to work for you, drop in on #linaro (I’m often active there when I’m at work).

I’ll try to follow up on the rest of this later, but at the moment I’m somewhat preoccupied with capturing the metrics of memory traffic that I want to use for further benchmarking.

Cheers,

Michael

Siarhei_Siamashka · July 2, 2011, 4:52pm

Discussion around the libjpeg-turbo iDCT implementation using NEON
intrinsics has moved here:
496298 - Implement ARM NEON optimized IDCT for JPEG decoding .

But why there? And why not directly contact upstream via
libjpeg-turbo-devel mailing list or libjpeg-turbo issue tracker? Also
I hope that you are not planning to go after Chromium or some other
libjpeg-turbo users as the next step...

My gcc patches are here:
crosstool-ng/patches/gcc/linaro-4.6-bzr at master · mkedwards/crosstool-ng · GitHub
. They're in the context of a version of crosstool-ng that I've adapted to
build a Linaro-based toolchain and a reasonably complete sysroot
environment. It's a little ragged around the edges, but if you would like
help getting it to work for you, drop in on #linaro (I'm often active there
when I'm at work).

Thanks for confirming that you are not just after hardfp calling
conventions, but also want your own custom ABI. And in order to make
it less problematic for you, now the whole world has to switch to
using intrinsics.

I must say that I don't like it. And I hope that this neon-d16 variant
never gets accepted into upstream gcc. The diversity and freedom of
choice is good, but not for the things like ABIs and standards.

Edwards_Michael · July 3, 2011, 6:52am

Wow, it must be exciting in your world. So many conspiracies!

Discussion around the libjpeg-turbo iDCT implementation using NEON
intrinsics has moved here:
https://bugzilla.mozilla.org/show_bug.cgi?id=496298 .

But why there? And why not directly contact upstream via
libjpeg-turbo-devel mailing list or libjpeg-turbo issue tracker? Also
I hope that you are not planning to go after Chromium or some other
libjpeg-turbo users as the next step…

“Go after”? I ran across the Mozilla “bug”, which currently appears stalled on iOS issues, and gave the nice folks there a heads-up that the NEON-intrinsic version existed – just in case it would be useful to them in one way or another. I asked some questions about how they measure the accuracy of JPEG decoding. I got courteous and helpful replies from DRC and Joe Drew. No part of this is meant in opposition to libjpeg-turbo or its developers, whom I respect and whose efforts I value, on coding and on seeking adoption of that code. If I considered it an “issue” with libjpeg-turbo – which I don’t, as the code in libjpeg-turbo is already quite satisfactory – I would of course take it up there.

My gcc patches are here:
https://github.com/mkedwards/crosstool-ng/tree/master/patches/gcc/linaro-4.6-bzr
. They’re in the context of a version of crosstool-ng that I’ve adapted to
build a Linaro-based toolchain and a reasonably complete sysroot
environment. It’s a little ragged around the edges, but if you would like
help getting it to work for you, drop in on #linaro (I’m often active there
when I’m at work).

Thanks for confirming that you are not just after hardfp calling
conventions, but also want your own custom ABI. And in order to make
it less problematic for you, now the whole world has to switch to
using intrinsics.

It’s not an ABI, it’s a target model; we can and do mix it with code compiled for the full NEON register set and for vfpv3-d16. And I don’t expect anyone else in the world ever to use it – not unless they need the same combination of performance characteristics in kernel and in userland we do, and control the compilation of every line of code in the system, as we do. We may never use it ourselves, because converting all the existing NEON assembly in the system into compiler intrinsics is going to be a pain in the keister, and the potential benefits are quite unproven. It’s just an option I’m keeping in my back pocket, and using by default in our prototype system just so I get an early warning if it causes problems.

I must say that I don’t like it. And I hope that this neon-d16 variant
never gets accepted into upstream gcc. The diversity and freedom of
choice is good, but not for the things like ABIs and standards.

Er, it’s a six-line patch, including documentation. I don’t give a flip whether it goes into anyone else’s gcc, now or ever. If you find the idea of being able to use NEON load/store operations in kernel with no save/restore overhead, give it a try! If you don’t, then don’t. But please, let’s take this off the BeagleBoard list; no one but you and me, and maybe Måns, cares.