> I guess you'd probably have to do it the way it's done for
> the DSP (which might be what you say, I don't know :).
Hm, I haven't thought about that. I'll check what's available through
cmemk.
cmemk uses ioremap_unached()/ioremap_cached() to "allocate" memory and
then remap_pfn_range() to map it in userspace. remap_pfn_range() uses
only 4KB pages.
I suspect that these 1280 translations cost too much (provided that my
calculations are not hard and the most processing time is taken by
memory accesses).
Luckily, Linux has support for supersection mappings through
ioremap_cached(). However it is enabled only for addresses above 4GB,
so I patched it to allow <4GB mappings.
--- a/arch/arm/mm/ioremap.c 2009-08-24 18:50:19.000000000 +0300
+++ b/arch/arm/mm/ioremap.c 2009-09-22 22:10:45.000000000 +0300
@@ -299,7 +299,7 @@ __arm_ioremap_pfn(unsigned long pfn, uns
#ifndef CONFIG_SMP
if (DOMAIN_IO == 0 &&
(((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() &
CR_XP)) ||
- cpu_is_xsc3()) && pfn >= 0x100000 &&
+ cpu_is_xsc3()) &&
!((__pfn_to_phys(pfn) | size | addr) &
~SUPERSECTION_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_supersections(addr, pfn, size, type);
I also wrote a simple module, that loops over 16MB memory, accessing
locations with pointer increment of 4KB. Something like:
for (i=0; i<N_ITERATIONS; i++) {
p = m;
for (j=0; j<N_PAGES_IN_16MB; j++) {
temp = *p;
p += PAGE_SIZE / sizeof(uint32_t);
}
}
I ran it with memory, allocated with vmalloc() and ioremap_cached().
Results were:
vmalloc: 1.41 ms per iteration
ioremap_cached: 1.22 ms per iteration
Just to clarify, on each iteration I make 4096 memory accesses, with
successive accesses being 4096B apart. This causes constant TLB
thrashing, as the DTLB is only 32 entries.
It seems that the cost of TLB thrashing alone is not too big (~0.2ms
for 4096 misses). However in my image processing code, I have non-
linear access patterns, and I try to make use of PLD instructions. And
PLD doesn't work, when the address results in a TLB miss. Non-working
PLDs bring further cost to my algorithm.
Next thing to try is port my stuff to work in kernelspace and use
ioremap_cached(). Poor I... How I wish there was a remap_pfn_range()
that could map supersections. Maybe I should try to write one, but I
don't feel like touching advanced stuff in Linux VMM.
Ivan