"Edwards, Michael" <m.k.edwa...@gmail.com> writes:
>> Where did you get that notion. There is nothing in the ARM ABI docs to
>> support it. In fact, the paragraph quoted above directly contradicts
>> your claim.
> You're absolutely right. Q4-Q7 are just as callee-save under the
> softfp ABI as they are under the hardfp ABI. The only additional
> *explicit* state that the official hardfp convention allows one to
> preserve -- not trivially, but with some effort -- is Q0-Q3. (That
> can be done by systematically altering your otherwise non-floating-
> point-using APIs.)
I fail to make sense of that paragraph. D0-D7 are call-clobbered, no
exceptions. If they are not used for arguments, the callee may still
use them as scratch registers.
As I wrote later in that message, I intend to adjust some of my inner
APIs to take an extra argument that maps to Q0-Q3 / D0-D7, and return
its value as their return value. That should have the effect of
leaving those registers untouched across the function / method call.
And if the callee has use for them, it can always save and restore
them. The compiler won't really know about this convention, so it
won't reschedule loads to Q0-Q3 across the function call; but
otherwise you're right, I'm adapting the ABI to my needs without
modifying the compiler.
>> D16-D31 are caller-saved.
> Mmm, so they are. This is another thing I was misremembering.
> Largely because I don't permit userland code to use them.
So you've invented your own, crippled ABI, then complain about
performance. Clever.
Yes, isn't it? But I'm not complaining at all; it's already doing
good things for my system's performance, and I haven't even done the
kernel work yet. For reasons that aren't yet apparent to me, a system
compiled uniformly with this hardfp neon-d16-fp16 ABI appears to
slightly outperform the identical code compiled for the regular hardfp
neon-fp16 model.
The only rational reason for this that I've been able to come up with
is that I haven't altered inline assembly to not use D16-D31, and when
the compiler doesn't use them for C/C++ code, it doesn't have to save
and restore them around these inline assembly blocks. This may or may
not be correct; I haven't had time to investigate it yet. But no, I'm
not complaining.
> I work on embedded systems where I control how all the code is
> compiled, and I compile for a neon-d16-fp16 model that doesn't
> correspond to any real hardware.
Any NEON implementation is required to have the full set of 32 D
registers. If you allow NEON, there is no point in restricting the
number of registers. (For pure VFP code, doing so allows the same code
to be used on both full and reduced register set implementations, at a
slight performance cost.)
As I think I explained, the point of restricting the number of
registers used in userland code is to leave them free for use in
kernel code, without save/restore overhead (other than FPSCR -- and if
the kernel code doesn't use FCMP, it doesn't even need to save/restore
FPSCR, except during a context switch). This obviously doesn't work
if you don't have complete control over every line of code in your
system, because any userland process that is compiled for a normal 32-
register NEON model is in for an unpleasant surprise. But that
complete control is one of the few advantages I do have on an embedded
platform, and I'm workin' it for all it's worth.
This raises another commonly misunderstood point. It can actually be
advantageous to compile most userland code without NEON, even for
memcpy/strcpy. That's because the kernel doesn't have to save/restore
FPU state on context switch for processes that have not touched the
FPU since the last context switch. If you choose to tune your kernel
this way, the VFP/NEON unit will be disabled on exit from the context
switch path. The first FPU instruction issued from userland will
generate an illegal instruction trap, which the kernel will catch; it
will restore the process's FPU context and reissue (or emulate) the
instruction that trapped. I think that in recent kernels you can opt
out of this "lazy restore" mechanism -- either at kernel configuration
time or per-process -- and if you use a NEONized memcpy, you probably
should.
It can also be an advantage to have 16 rather than 32 VFP registers,
because you have half as much context to save and restore. However,
no NEON implementation of which I'm aware can be told to trap on
access to D16-D31. So if your hardware has the "full" VFP/NEON
register set, you have to save/restore the full set, even for
processes whose code is compiled for a vfpv3-d16 model -- because
that's not part of the ABI contract. Unless, of course, you control
the compilation every line of code on your embedded system, in which
case you can do what you want. (You still have to audit the assembly
code throughout your system for use of the upper half of the VFP bank;
I plan to run for a while with 0xdeadbeef tell-tales, verified in the
context switch code path, to catch whatever my static analysis
misses.)
> I intend to reserve the upper half of the VFP/NEON register bank for
> use in-kernel, so I can trampoline data moves through D16-D31 without
> having to save userland's content and restore it afterwards. (Not
> because saving and restoring them is expensive, but because it would
> have to be done from a place in the kernel where the FPU context- save
> thingy is handy. I'd rather just use Q8-Q15 as scratch registers
> anywhere in the kernel I want to, with nothing to save/restore but the
> FPSCR.)
I can't imagine the cost of stealing these registers from heavy
float/simd users being compensated by a few minor savings in the kernel.
Well, I tried to explain the part about keeping save/restore overhead
down. I can add a couple of things: unlike ARM-side
"registers" (which are really just labels in the instruction stream,
and are allocated from a larger pool of physical registers), NEON
registers are locked to real hardware locations. So if the kernel
needs to spill userland's values from D16-D31 in order to use them for
bulk data moves, the store operation is going to stall waiting for the
completion of any outstanding userland-initiated pipeline activity
involving them. And on the return to userland, the load operation
that restores their contents will have to complete before the user
process can really get going again.
This is part of why modern processors of the x86/x86_64 architecture
have FXSAVE/FXRSTOR. These operations spill not just the contents of
the visible floating-point registers but also internal pipeline state,
so they don't have to stall for all in-progress operations to
complete. They also don't really spill all the way to main memory
unless all the shadow FPU contexts have been allocated. A lot of this
is neither architecturally visible nor particularly well documented,
but Intel and AMD have gone to truly amazing lengths to optimize their
processors for real-world workloads, including the sort of frequent
context switches among a small set of processes that are typical of
desktop OSes. (Intel in particular learned this the hard way; was
anyone else here around for the i860?)
So for short trips into and out of kernel whose main job is to move a
few hundred bytes of data from here to there, it ought to be a
substantial win to be able to trampoline through Q8-Q15 without the
overheads, visible and invisible, of a save/restore cycle. This goes
double if I'm going to use NEON instructions from within ISRs to move
data into and out of uncacheable memory. (That's exactly what we do
today on our x86 SoC -- with MOVNTDQ(A) substituted for VLDM/VSTM, of
course -- to work around a silicon erratum which requires us to flush
an architecturally invisible buffer in the chip's DRAM arbiter.)
Perhaps someone else could try rephrasing in language Måns might find
more enlightening -- or correcting me if I'm wrong, which is always
possible. Otherwise, I guess we're going to have to wait until the
benchmarks are in. Obviously, if reserving D16-D31 for kernel use
doesn't prove to be a win in our full system, we won't do it. But my
measure of "win" may be different from yours. I don't care about
maximizing the idle fraction of CPU; I care about making my system's
UI as responsive and jitter-free as possible, even though the bulk of
the SoC's throughput to DRAM is occupied by video capture/encode/
decode/display traffic.
>> Are you implying there is some not yet benchmarked case where it
>> performs significantly better?
> Oh yes. Presently, only when combined with APIs that sling structures
> opaquely as composite types, and code that uses NEON intrinsics to
> load and store them.
Sounds like poor API design.
Hey, I'd love to have an official ABI in which I get to choose,
function by function, whether Q0-Q3 are parameter-passing/scratch
registers or callee-save. Failing that, I am making do with what I do
have, which is a kludge that I can hide behind some C++ template
magic. In no way do I consider this a shining beacon of API design;
but for embedded work, I'll take an adequately documented, somewhat
ugly, screaming fast API over an elegant but slow one every time.
YMMV.
> But I am expecting those techniques to become common inside template
> libraries within the next couple of years.
If you are right, that's yet another reason to avoid such libraries.
This is the crux of the matter, isn't it? I don't begin to understand
most of the techniques at work inside GCC, let alone G++ or Boost, but
I am quite content to use them. And, when necessary, to learn how to
abuse them for fun and profit.
> And even in some non-template libraries; you might take a look at the
> NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
> those to the hard-float case.
I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.
As I wrote, "extrapolate these to the hard-float case". If you look
at the code a bit, perhaps you can see the potential benefit of
refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
using compiler intrinsics rather than raw assembly, and letting the
compiler handle register allocation and load/store latencies? And of
rewriting idct_helper and transpose_4x4 as inline functions, operating
on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte chunk of
data passed by value? That's exactly what the datatypes defined in
AAPCS are for.
In this particular instance it wouldn't make any difference if we were
to pass operands by value into the innermost "publicly visible"
function, simply because they are too large (128 bytes). But a more
extensive refactor would permit this function to be inlined, and that
would definitely tighten things up. Point being, I didn't mean to say
that these techniques (multiple-cache-line-sized loads/stores, use of
containerized vector datatypes and pass-by-value) were already in use
in these libraries. I meant what I said, which is that I expect them
to become common in libraries where they are worth the effort, which
will include some non-template libraries.
>> Using VFP register parameters (i.e. doing nothing) is never less
>> efficient than moving them to core registers (doing something).
> On the contrary; hardfp can definitely be a net lose on real code.
> Consider cases where the outer function slings structures with mixed
> integers and floats, and the inner function does the actual floating
> point arithmetic. The hardfp convention requires the caller to
> transfer floating point parameters into VFP registers before entering
> the function, rather than leaving them in integer registers (where
> they can be put for free, because they are already in L1).
Sounds like that API really ought to be passing a pointer to a struct,
not passing the struct by value.
The inner function doesn't know anything about the struct; it operates
on bare floats/doubles. The outer function slings mixed structures,
and as soon as it touches them at all, it has them in L1. Under the
softfp convention, the outer function can pull the floating point
operands of the inner function into integer registers any time it's
convenient, maybe as part of an LDM that pulls in some integer/pointer
elements of the same struct. Then they just need to be spilled out
onto the stack for the function call, either in the caller (for
operands beyond the first 4 words' worth) or in the callee (typically
in the function preamble).
The callee loads them into VFP registers; at hardware level, this
happens via a lookaside to L1, so it's basically free as far as memory
traffic goes. As long as enough useful work can be scheduled in the
callee to cover the VLD latency, it's all good. That's one reason why
conventional benchmarks of hardfp vs. softfp don't show any benefit on
real code. (Who writes code that has inner loops over publicly
visible APIs in which both caller and callee do floating point or SIMD
arithmetic on the same values -- and thus produces a noticeable
pipeline stall from spilling a computed parameter out of the VFP bank
and then back in? One doing arithmetic, and the other doing loads/
stores, simply doesn't count.)
Back to the specific example I cited: Under the hardfp convention,
the floating point operands have to get moved over to the VFP side
before the function call, which would involve two VMOVs per 64-bit
operand. That's stupid, so instead it gets done by a spill to stack
followed by a VLD, or by a separate load from the original structure.
This may no longer be in L1, of course; so there's an opportunity for
the compiler to screw up; a well written compiler shouldn't. So
basically, there's going to be a VLD from stack either right before or
right after the branch. The net effect is almost certainly trivial --
as I said -- but either hardfp or softfp could be a (slight) win.
> That's probably a trivial effect; but at least on Cortex-A8, there are
> others that hit some code bases much harder. What if the callee does
> no arithmetic, but passes the argument to a variadic function? Or the
> callee returns a value fetched from memory, which happens to be
> floating point, and the caller turns around and sticks it into an
> otherwise integer-filled structure? Either way, you take the full hit
> of the transfer to D0 and back to the integer side, for nothing.
You seem to be missing something about how structs are actually
represented at the backend of a compiler.
Educate me. I say I have a double X in a struct in memory, which I
want to pass to non-variadic function A, which then passes it to
variadic function B. The hardfp convention requires that I pull X
into D0 before branching to A, which has to move it from D0 to r0+r1
before passing it to B. What about "how structs are actually
represented at the backend of a compiler" saves me from the overhead
of this maneuver, relative to the softfp convention (in which X is in
r0+r1 for the call to A and needn't be touched before A calls B)?
In the second example in that paragraph, I call function C, which
returns a double Y (fetched from memory, not computed). I want to
stick this in a struct along with integer J and pointer Q. The hardfp
convention requires that Y be returned in D0, and to get it into the
struct I may need to issue three separate stores (STR, VSTR, STR --
assuming Y is between J and Q and I'm exploiting address post-
increment). In the softfp convention, Y will be returned in r0+r1,
and all I have to do is shuffle it into appropriate registers and
issue one STM.
This is, of course, all small stuff. All that I'm trying to show is
that one shouldn't look for system-wide wins from the hardfp ABI in
the "obvious" places, because 1) real code doesn't often do things
that cause softfp to lose significantly, and 2) real code does often
do things that cause hardfp to lose slightly. To make hardfp win, you
have to exploit its "invisible" benefits, which are mostly about
covering memory latencies by using Q0-Q3 to pass values into and out
of functions that are *still in-flight* as cache-line-sized memory
transactions.
>> Are you saying increased use of NEON by gcc will make hardfp calls
>> slower?
> The reverse; but I can understand your reading my contorted syntax
> that way. I expect that GCC will get better at using the NEON unit
> for non-floating-point purposes. That will make it worthwhile for
> core libraries, from eglibc and libstdc++ on up, to adapt their
> internal calling conventions to permit the sort of "stupid
> rescheduling tricks" that win when building hardfp.
> You may say that it shouldn't matter for APIs that aren't "publicly
> visible", and that no human-readable API should do stupid things like
> pass an opaque operand in Q0-Q3 and return it unchanged as its return
> value (still in Q0-Q3).
Such a constraint cannot be expressed in a C API (nor a C++ one AFAIK).
To make that work, you'd have to either:
1. Change the ABI spec.
2. Teach the compiler extended semantics about specific functions in the
same way it already recognises many standard library calls.
3. Write all code by hand in assembler with no standard calling
conventions at all.
None of these seem particularly compelling, nor likely to happen.
4. None of the above. Simply change the calling conventions on your
inner functions from
int myfunc(char* p, double x)
to
c64byte_t myfunc(c64byte_t blob, int* result, char* p, double x)
and replace each "return r;" with "*result = r; return blob;". Call
sites change from
n = myfunc(q, y);
to
blob = myfunc(blob, &n, q, y);
This is only useful if you want to reorder -- by hand; the compiler
won't do it for you -- a fetch to "blob", from after the call to
myfunc() to before it. But that's exactly what I want to do a lot of
the time, because the real return value of myfunc() needs to be stuck
into a data structure that isn't in cache. So I want to go ahead and
schedule the fetch of this data structure into Q0-Q3 before the call
to myfunc(); execute the body of myfunc() while the fetch is still in
flight; and update the data structure before storing it right back.
In many cases, I want to bypass the cache hierarchy entirely in both
directions, because that data structure probably won't be touched
again until after it has aged out of L2 anyway. So the fetch and
store of "blob" are done via NEON intrinsics through a pointer that
lies in an uncacheable mapping. Currently this is another constraint
that cannot be expressed in a C or C++ API; but I don't intend to let
that stop me, either.
>> The A9 and later indeed make the softfp calls less costly, reducing any
>> advantage hardfp might have (which is already small in benchmarks on A8).
> Even the idea that A9 is less friendly *overall* to hardfp than A8 is
> debatable, at the current level of compiler implementation.
The A9 is not in any way "less friendly" to hardfp. It is, however,
less hostile to softfp.
Its cache hierarchy is different, in ways that are not fully described
in the TRM. Its automatic prefetch mechanism is also still somewhat
unproven, especially on the load patterns I care about. I consider it
debatable that it is either "less friendly" to hardfp or "less
hostile" to softfp in any way that matters. But as I said before, I
don't really wish to debate it without data.
>> Do you have any numbers to back this up? I don't see how going through
>> NEON registers would be faster than direct LDM/STM on any core.
> I will produce those numbers within the month, or admit defeat.
> Seriously, I'd better be able to substantiate this by mid-July or so,
> or my team is going to have to rethink certain aspects of one of its
> current development efforts.
I'm glad I'm not invested in that effort.
On this I suppose we agree. You have an admirable track record as a
coder, and clearly also a deep understanding of some aspects of the
OMAP chip series. But you seem awfully sure that your bag of tricks
contains all the tricks that matter. That attitude gets tiresome
after a while.
>> The out of order issue on A9 and later makes most such tricks unnecessary.
> Er, no. Out of order issue helps reduce bubbles in the ALU for math-
> intensive loads whose working set fits in cache.
Out of order issue potentially allows a load to be issued sooner than it
appears in the instruction stream, thus hiding some of the latency
whether it hits L1 or not.
As I understand it, the A9's out-of-order execution capabilities are
not on the scale that would be needed to cover latency to DRAM. I'm
aware of how speculative loads and stride-detection-based auto-
prefetching work, and they certainly have their uses. But as much as
I would like to believe that trampolining loads through the NEON will
be unnecessary on the A9, my experience with the much more extensive
out-of-order capabilities of server-class 64-bit architectures leads
me to believe otherwise.
>> To the extent scheduling across function calls is permitted by the C
>> standard, the manner of passing parameters has no bearing on such
>> optimisations.
> OK, I admit that I'm planning to cheat here. I'm going to keep state
> that the compiler would otherwise allocate to the callee-save
> registers in Q0-Q3, and keep passing this block into and back out of
> mostly non-floating-point-using APIs, which effectively makes it
> callee-save state that doesn't wind up being touched by the callee.
So you've modified the ABI again.
You can call it that if you like. Unlike actually "modifying the
ABI", this doesn't involve any change to the compiler. So I like to
think that I'm modifying a layer between the human-visible API and the
actual ABI, in much the way that C++ iostreams and templates like
Glib::ustring::compose() do.
> When combined with the neon-d16-fp16 model, this should induce the
> compiler to use Q4-Q7 as its NEON working set. Since it knows this
> range is callee-save, it's safe to schedule loads with ample provision
> for cache miss latency, even if it has to move them across function/
> method calls.
So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.
Now who's waving around straw men? The load patterns that I'm worried
about don't often use NEON for algorithms that need 32 8-byte
registers. Yes, having that full bank of registers makes libjpeg-
turbo's iDCT more compact; but I don't much care, because JPEG decode
latency is not the most critical thing in my system.
Back-of-the-envelope calculations say that the single most critical
resource in *my* system is DRAM bandwidth, and that I will need to go
to quite a bit of effort to keep from frittering it away with word-
sized loads from uncacheable regions and read-modify-write cycles on
partially clobbered cache lines. Until I have benchmarks that say
differently, I'm going to focus on altering the CPU behavior to use
the memory interface efficiently rather than the other way around.
From that perspective, NEON registers are mostly placeholders for in-
flight memory transactions, and I hope to allocate them where they
will do the most good.
>> If a function is fully inlined, the compiler can of course do whatever
>> it pleases. That is the entire point of inlining.
> I think it's a little subtler than that in C++; but I am no language
> lawyer. Suffice it to say that what the compiler does *in practice*
> appears to be heavily influenced by whether there is any way for the
> method to be called through a "publicly visible" symbol.
A function identifiable as a symbol, public or not, is by definition not
inlined. It is perfectly legal for the compiler to inline some or all
calls to a function while still producing a symbol with a valid entry
point for it. If this happens, this symbol must of course behave
according to ABI rules. For the inlined "calls", there is no ABI-level
call, and thus calling conventions no longer apply.
Have you ever written a pure-header C++ library? I have. The rules
about what constitutes a "publicly visible" symbol are actually quite
intricate when they crop up, not in a "library" .o file, but in one or
more of the application-level .o files compiled against the same set
of headers. The compiler has to apply the same rules to produce
equivalent implementations of the same method in each .o, so whichever
one winds up surviving the link step can fill in for all the others.
Liberal application of __attribute__(always_inline) helps; but this
does mix strangely with std::tr1::mem_fn, whose implementation I find
quite opaque.
In short: maybe the compiler is free to disregard the ABI on anything
it chooses to (or is forced to) inline. But that doesn't necessarily
mean that it finds every possible ABI-breaking optimization without
some hints from the library programmer. Compiler writers are human
too, and can't be expected to think of all the stupid things people
like me want to coerce the compiler into doing.
In summary, you have created your own ABI that reserves most of the
VFP/NEON registers for special uses that conflict with how AAPCS/VFP
passes floating-point arguments to functions. You then use this as
foundation for a series of contradicting arguments for and/or against
the hardfp ABI over softfp.
My own ABI? Not really. More like my own target CPU model, and my
own techniques for wringing performance out of the hardfp ABI;
although there's really nothing original in them. I stand on the
shoulders of giants.
Contradicting arguments? I don't think so, except insofar as I was in
error on a couple of points the first time around, and tried to
correct that after you helpfully pointed out the error. If there
remain contradictions, please do point them out, and I'll attempt to
resolve them.
For and against hardfp? Yes, because hardfp does have retrograde
cases, and you have to work pretty hard to get much value out of it.
Still, I think the game is worth the candle, and I intend to prove
it. Thanks for stimulating me to articulate how.
Cheers,
- Michael