Mapping memory with supersections

Hi all,

Do you know, whether there's a way to map a large physically
contiguous memory area using supersections (16MB) and not standard
pages (4KB). I'd like to map it in process address space.

I think that linux uses (super)section mappings for kernel memory
(C0000000 - C80000000) (at least this was the case for the ARM926 on
DaVinci), but is it possible to create similar mappings for userspace?

Ivan

Do you know, whether there's a way to map a large physically
contiguous memory area using supersections (16MB) and not standard
pages (4KB). I'd like to map it in process address space.

That'd only be possible if hugetlbfs is supported by ARM kernel,
and I don't think it is.

I think that linux uses (super)section mappings for kernel memory
(C0000000 - C80000000) (at least this was the case for the ARM926 on
DaVinci), but is it possible to create similar mappings for userspace?

Small clarification: on ARM926 only sections are available.

Laurent

So I guess, my only option would be to do my work in kernelspace,
allocating the large portion of memory @ boot time (with
bigphysarea?), right?

My goal is to reduce translation table walks for this memory.

Ivan

So I guess, my only option would be to do my work in kernelspace,
allocating the large portion of memory @ boot time (with
bigphysarea?), right?

I guess you'd probably have to do it the way it's done for
the DSP (which might be what you say, I don't know :).

My goal is to reduce translation table walks for this memory.

Do you really have an issue with that? Can't you rearrange
your code and data such that the number of TLB misses is
reduced instead of reducing available memory to all
processes except yours?

Laurent

> So I guess, my only option would be to do my work in kernelspace,
> allocating the large portion of memory @ boot time (with
> bigphysarea?), right?

I guess you'd probably have to do it the way it's done for
the DSP (which might be what you say, I don't know :).

Hm, I haven't thought about that. I'll check what's available through
cmemk.

> My goal is to reduce translation table walks for this memory.

Do you really have an issue with that? Can't you rearrange
your code and data such that the number of TLB misses is
reduced instead of reducing available memory to all
processes except yours?

Well, I don't have a compelling reason to use supersections, but it
would be nice if I can try them. I'm writing an image processing
algorithm and I'd like to see, how do translation table walks hamper
performance. Probably sections would be good enough for me, since my
actual memory requirements are ~5MB.

I access all the 5MB memory -- some parts linearly and another parts
in a non-linear way (I'm doing distortion correction). If I use 4KB
pages, for 5MB memory I'd need at least 1280 translations (and
probably they'd be more, if the TLB can't accommodate my non-linear
access pattern).

I suspect that these 1280 translations cost too much (provided that my
calculations are not hard and the most processing time is taken by
memory accesses).

Ivan

> I guess you'd probably have to do it the way it's done for
> the DSP (which might be what you say, I don't know :).

Hm, I haven't thought about that. I'll check what's available through
cmemk.

cmemk uses ioremap_unached()/ioremap_cached() to "allocate" memory and
then remap_pfn_range() to map it in userspace. remap_pfn_range() uses
only 4KB pages.

I suspect that these 1280 translations cost too much (provided that my
calculations are not hard and the most processing time is taken by
memory accesses).

Luckily, Linux has support for supersection mappings through
ioremap_cached(). However it is enabled only for addresses above 4GB,
so I patched it to allow <4GB mappings.
--- a/arch/arm/mm/ioremap.c 2009-08-24 18:50:19.000000000 +0300
+++ b/arch/arm/mm/ioremap.c 2009-09-22 22:10:45.000000000 +0300
@@ -299,7 +299,7 @@ __arm_ioremap_pfn(unsigned long pfn, uns
#ifndef CONFIG_SMP
        if (DOMAIN_IO == 0 &&
            (((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() &
CR_XP)) ||
- cpu_is_xsc3()) && pfn >= 0x100000 &&
+ cpu_is_xsc3()) &&
               !((__pfn_to_phys(pfn) | size | addr) &
~SUPERSECTION_MASK)) {
                area->flags |= VM_ARM_SECTION_MAPPING;
                err = remap_area_supersections(addr, pfn, size, type);

I also wrote a simple module, that loops over 16MB memory, accessing
locations with pointer increment of 4KB. Something like:
for (i=0; i<N_ITERATIONS; i++) {
    p = m;
    for (j=0; j<N_PAGES_IN_16MB; j++) {
        temp = *p;
        p += PAGE_SIZE / sizeof(uint32_t);
    }
}
I ran it with memory, allocated with vmalloc() and ioremap_cached().
Results were:
vmalloc: 1.41 ms per iteration
ioremap_cached: 1.22 ms per iteration

Just to clarify, on each iteration I make 4096 memory accesses, with
successive accesses being 4096B apart. This causes constant TLB
thrashing, as the DTLB is only 32 entries.

It seems that the cost of TLB thrashing alone is not too big (~0.2ms
for 4096 misses). However in my image processing code, I have non-
linear access patterns, and I try to make use of PLD instructions. And
PLD doesn't work, when the address results in a TLB miss. Non-working
PLDs bring further cost to my algorithm.

Next thing to try is port my stuff to work in kernelspace and use
ioremap_cached(). Poor I... How I wish there was a remap_pfn_range()
that could map supersections. Maybe I should try to write one, but I
don't feel like touching advanced stuff in Linux VMM.

Ivan