Ubuntu 10.10 issues

Phil · November 21, 2010, 7:40am

I've been working with the Ubuntu 10.10 image obtained via the link on
http://elinux.org/BeagleBoardUbuntu on my -xM and having a problem
that I'm curious if there are currently any workarounds for or if
others are even having this problem. The issue appears to be very
heavy CPU utilization (to the point that it appears the board has
locked up... but it will come back to life several minutes later) when
there is 'heavy' i/o activity (several megabytes a second) which
appears to be related to syslog being busy logging errors. When I
look at the logs, I see a couple of things going on:

1) kernel: smsc95xx usb1: kevent 2 may have been dropped (hundreds of
messages per second... despite the fact that $RepeatedMsgReduction is
on in rsyslog.conf)
2) numerous page allocation failures with backtraces

So far, I've run into this when transferring large files via sftp,
when recording via MythTV, and when playing back MythTV recordings.
When I've noticed this occurring I've run top and the board shows 400+
meg being used as cache (which has me scratching my head... isn't that
effectively free memory?), a load average of 10+, and cpu utilization
for si at well over 50% (often 99%). I'm running with no swap and a
USB hard drive for /var and /tmp and the data files being read/written
to (this problem was far worse even sooner when I was attempting to
run entirely from flash). Any ideas are appreciated and I'm happy to
collect any additional data that would be helpful.

Thanks,
Phil

Phil · November 21, 2010, 8:08pm

I was able to get a single instance of this failure captured in the
log files that is a little more manageable (about an 800k syslog) over
a 6 minute period which was approximately how long the system went
completely unresponsive. It appears that problem starts with the page
allocation failures, which results in packets being dropped while the
kernel is busy trying to free memory, it's also possible that all this
logging activity snowballs dropping more packets... that it then needs
to log as well. The problem finally seems to resolve itself when all
productive activity has essentially halted and the system is able to
get caught up with the log messages and then performance and
responsiveness return to normal. Again, while this instance is
occurring with MythTV (only the backend is running on the -xM), I've
seen the same issue when sending large files via sftp and the thing
both activities seem to have in common is they are over long
timeframes (minutes to hours) reading several megabytes a second via
Ethernet and writing it out to local disk (USB or microSD). Here's a
bit more detail as to what is being logged:

1) kernel: smsc95xx usb1: kevent 2 may have been dropped (hundreds of
messages per second... despite the fact that $RepeatedMsgReduction is

These messages begin occurring after the page allocation errors
below. 3500 occurrences of these messages.

on in rsyslog.conf)
2) numerous page allocation failures with backtraces

There were 90 page allocation failures, here is the first one:

Nov 21 14:09:19 cortex kernel: [50205.769378] __alloc_pages_slowpath:
308 callbacks suppressed
Nov 21 14:09:19 cortex kernel: [50205.769409] mythfilldatabas: page
allocation failure. order:3, mode:0x4020
Nov 21 14:09:19 cortex kernel: [50205.769470] [<c00512a0>]
(unwind_backtrace+0x0/0x100) from [<c06386c0>] (dump_stack+0x18/0x1c)
Nov 21 14:09:19 cortex kernel: [50205.769500] [<c06386c0>] (dump_stack
+0x18/0x1c) from [<c011a7f4>] (__alloc_pages_nodemask+0x60c/0x69c)
Nov 21 14:09:19 cortex kernel: [50205.769531] [<c011a7f4>]
(__alloc_pages_nodemask+0x60c/0x69c) from [<c011a8a0>]
(__get_free_pages+0x1c/0x44)
Nov 21 14:09:19 cortex kernel: [50205.769531] [<c011a8a0>]
(__get_free_pages+0x1c/0x44) from [<c0147818>] (__kmalloc_track_caller
+0x174/0x208)
Nov 21 14:09:19 cortex kernel: [50205.769561] [<c0147818>]
(__kmalloc_track_caller+0x174/0x208) from [<c052ef24>] (__alloc_skb
+0x5c/0xe8)
Nov 21 14:09:19 cortex kernel: [50205.769592] [<c052ef24>] (__alloc_skb
+0x5c/0xe8) from [<c044e714>] (rx_submit+0x2c/0x260)
Nov 21 14:09:19 cortex kernel: [50205.769622] [<c044e714>] (rx_submit
+0x2c/0x260) from [<c044eea8>] (rx_complete+0x1b4/0x1b8)
Nov 21 14:09:19 cortex kernel: [50205.769653] [<c044eea8>] (rx_complete
+0x1b4/0x1b8) from [<c045e188>] (usb_hcd_giveback_urb+0x6c/0x100)
Nov 21 14:09:19 cortex kernel: [50205.769683] [<c045e188>]
(usb_hcd_giveback_urb+0x6c/0x100) from [<c047320c>] (ehci_urb_done
+0xa8/0xc8)
Nov 21 14:09:19 cortex kernel: [50205.769683] [<c047320c>]
(ehci_urb_done+0xa8/0xc8) from [<c04734b4>] (qh_completions
+0x288/0x53c)
Nov 21 14:09:19 cortex kernel: [50205.769714] [<c04734b4>]
(qh_completions+0x288/0x53c) from [<c0475268>] (ehci_work+0x9c/0x109c)
Nov 21 14:09:19 cortex kernel: [50205.769714] [<c0475268>] (ehci_work
+0x9c/0x109c) from [<c0479230>] (ehci_irq+0x314/0x390)
Nov 21 14:09:19 cortex kernel: [50205.769744] [<c0479230>] (ehci_irq
+0x314/0x390) from [<c045d288>] (usb_hcd_irq+0x40/0xa4)
Nov 21 14:09:19 cortex kernel: [50205.769775] [<c045d288>] (usb_hcd_irq
+0x40/0xa4) from [<c00e4708>] (handle_IRQ_event+0x78/0x238)
Nov 21 14:09:19 cortex kernel: [50205.769805] [<c00e4708>]
(handle_IRQ_event+0x78/0x238) from [<c00e6de8>] (handle_level_irq
+0xb0/0x174)
Nov 21 14:09:19 cortex kernel: [50205.769836] [<c00e6de8>]
(handle_level_irq+0xb0/0x174) from [<c004a04c>] (asm_do_IRQ+0x4c/0x98)
Nov 21 14:09:19 cortex kernel: [50205.769866] [<c004a04c>] (asm_do_IRQ
+0x4c/0x98) from [<c063bb6c>] (__irq_svc+0xac/0x144)
Nov 21 14:09:19 cortex kernel: [50205.769866] Exception
stack(0xcfd81c58 to 0xcfd81ca0)
Nov 21 14:09:19 cortex kernel: [50205.769897]
1c40: 00000001
c1658040
Nov 21 14:09:19 cortex kernel: [50205.769897] 1c60: c1658040 c1658040
a0000013 00000000 c1658040 00000001 00000001 cfd81e18
Nov 21 14:09:19 cortex kernel: [50205.769927] 1c80: d4776cac cfd81ccc
c1658034 cfd81ca0 c007f774 c007e0e4 60000013 ffffffff
Nov 21 14:09:19 cortex kernel: [50205.769958] [<c063bb6c>] (__irq_svc
+0xac/0x144) from [<c007e0e4>] (__wake_up_sync_key+0x54/0x84)
Nov 21 14:09:19 cortex kernel: [50205.769958] [<c007e0e4>]
(__wake_up_sync_key+0x54/0x84) from [<c007e128>] (__wake_up_sync
+0x14/0x18)
Nov 21 14:09:19 cortex kernel: [50205.769989] [<c007e128>]
(__wake_up_sync+0x14/0x18) from [<c05bda70>] (unix_write_space+0x4c/
0x98)
Nov 21 14:09:19 cortex kernel: [50205.770019] [<c05bda70>]
(unix_write_space+0x4c/0x98) from [<c0529670>] (sock_wfree+0x74/0x7c)
Nov 21 14:09:19 cortex kernel: [50205.770050] [<c0529670>] (sock_wfree
+0x74/0x7c) from [<c052dfec>] (skb_release_head_state+0x8c/0x134)
Nov 21 14:09:19 cortex kernel: [50205.770050] [<c052dfec>]
(skb_release_head_state+0x8c/0x134) from [<c052dcec>] (__kfree_skb
+0x14/0xcc)
Nov 21 14:09:19 cortex kernel: [50205.770080] [<c052dcec>] (__kfree_skb
+0x14/0xcc) from [<c052de34>] (kfree_skb+0x44/0xcc)
Nov 21 14:09:19 cortex kernel: [50205.770111] [<c052de34>] (kfree_skb
+0x44/0xcc) from [<c05bd0f0>] (unix_stream_recvmsg+0x2bc/0x6a8)
Nov 21 14:09:19 cortex kernel: [50205.770111] [<c05bd0f0>]
(unix_stream_recvmsg+0x2bc/0x6a8) from [<c0524fd8>] (sock_aio_read
+0x174/0x17c)
Nov 21 14:09:19 cortex kernel: [50205.770141] [<c0524fd8>]
(sock_aio_read+0x174/0x17c) from [<c0151578>] (do_sync_read+0xb4/0xf4)
Nov 21 14:09:19 cortex kernel: [50205.770172] [<c0151578>]
(do_sync_read+0xb4/0xf4) from [<c01524e4>] (vfs_read+0x1b0/0x1b8)
Nov 21 14:09:19 cortex kernel: [50205.770202] [<c01524e4>] (vfs_read
+0x1b0/0x1b8) from [<c01525bc>] (sys_read+0x44/0x74)
Nov 21 14:09:19 cortex kernel: [50205.770202] [<c01525bc>] (sys_read
+0x44/0x74) from [<c004aae0>] (ret_fast_syscall+0x0/0x30)
Nov 21 14:09:19 cortex kernel: [50205.770233] Mem-info:
Nov 21 14:09:19 cortex kernel: [50205.770233] Normal per-cpu:
Nov 21 14:09:19 cortex kernel: [50205.770233] CPU 0: hi: 186,
btch: 31 usd: 207
Nov 21 14:09:19 cortex kernel: [50205.770263] active_anon:2758
inactive_anon:8787 isolated_anon:0
Nov 21 14:09:19 cortex kernel: [50205.770263] active_file:49967
inactive_file:24828 isolated_file:0
Nov 21 14:09:19 cortex kernel: [50205.770263] unevictable:0 dirty:330
writeback:0 unstable:0
Nov 21 14:09:19 cortex kernel: [50205.770294] free:28320
slab_reclaimable:4700 slab_unreclaimable:1447
Nov 21 14:09:19 cortex kernel: [50205.770294] mapped:5846 shmem:157
pagetables:320 bounce:0
Nov 21 14:09:19 cortex kernel: [50205.770324] Normal free:113280kB min:
2884kB low:3604kB high:4324kB active_anon:11032kB inactive_anon:
35148kB active_file:199868kB inactive_file:99312kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:0kB
dirty:1320kB writeback:0kB mapped:23384kB shmem:628kB slab_reclaimable:
18800kB slab_unreclaimable:5788kB kernel_stack:1960kB pagetables:
1280kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Nov 21 14:09:19 cortex kernel: [50205.770355] lowmem_reserve: 0 0
Nov 21 14:09:19 cortex kernel: [50205.770385] Normal: 446*4kB 7305*8kB
3306*16kB 5*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 113280kB
Nov 21 14:09:19 cortex kernel: [50205.770416] 74955 total pagecache
pages
Nov 21 14:09:19 cortex kernel: [50205.770416] 0 pages in swap cache
Nov 21 14:09:19 cortex kernel: [50205.770446] Swap cache stats: add 0,
delete 0, find 0/0
Nov 21 14:09:19 cortex kernel: [50205.770446] Free swap = 0kB
Nov 21 14:09:19 cortex kernel: [50205.770446] Total swap = 0kB
Nov 21 14:09:19 cortex kernel: [50205.786865] 131072 pages of RAM
Nov 21 14:09:19 cortex kernel: [50205.786865] 28877 free pages
Nov 21 14:09:19 cortex kernel: [50205.786895] 8735 reserved pages
Nov 21 14:09:19 cortex kernel: [50205.786895] 5641 slab pages
Nov 21 14:09:19 cortex kernel: [50205.786895] 50406 pages shared
Nov 21 14:09:19 cortex kernel: [50205.786895] 0 pages swap cached

Any ideas are appreciated,
Phil

RobertCNelson · November 23, 2010, 12:58am

Hi Phil,

I'm seeing that same smsc95xx error message on one of my xM test box..
(with 2.6.35 and 2.6.36)

does 2.6.37-rc3 help for you?

install-me: (assuming maverick) http://rcn-ee.net/deb/maverick/v2.6.37-rc3-dl1/

Does it help to add a swap file? it looks like it's oom'ing too..

Regards,

Phil · November 23, 2010, 6:11am

Robert,

Hi Phil,

I'm seeing that same smsc95xx error message on one of my xM test box..
(with 2.6.35 and 2.6.36)

does 2.6.37-rc3 help for you?

install-me: (assuming maverick) http://rcn-ee.net/deb/maverick/v2.6.37-rc3-dl1/

I'm currently running 2.6.35.7-l6 and want to get a handle on what
I've just discovered (below) before I start changing things since it
seems like I'm getting a bit closer to understanding at least part of
what's going on.

Does it help to add a swap file? it looks like it's oom'ing too..

I was prepared to deny I was oom until I caught this with dstat a few
minutes ago:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---
system-- ------memory-usage-----
usr sys idl wai hiq siq| read writ| recv send| in out | int csw

used buff cach free

20 10 52 8 0 10| 569k 1305k|1430k 1749k| 0 0 |2750 4001

67.2M 5360k 401M 4444k

18 10 51 10 0 12| 460k 1482k|1450k 1736k| 0 0 |2727 4020

67.9M 5400k 400M 4688k

46 18 8 14 0 13|1409k 1465k|1450k 357k| 0 0 |2172 3821

80.4M 5756k 374M 18.3M

61 24 0 2 0 13|1516k 1506k|1428k 692B| 0 0 |2067 2893

381M 1264k 92.0M 3996k

8 1 0 0 0 90| 728k 1120k|1153k 1535B| 0 0 |2485 2230

394M 676k 78.9M 4504k missed 7 ticks

1 1 0 0 0 98|1533k 392k| 366k 468B| 0 0 |2016 751

402M 748k 66.7M 9076k missed 10 ticks

37 14 27 11 0 10|2934k 1338k|1319k 896B| 0 0 |2037 3305

64.2M 1516k 93.3M 319M

16 6 66 8 0 4|2458B 1434k|1427k 720B| 0 0 |1562 3749

64.3M 1624k 107M 305M

It appears that mythbackend *is* spiking periodically (at the time
free memory dropped to 18.3M I was showing 4 mythbackend processes in
top with 3 of them using ~120M each, before things settled back down
and the system showed a single mythbackend at 17M) This rather
surprised me but it is at least making sense as to why I'd be seeing
allocation errors. I really am running out of memory... strangely,
that makes me feel better.

It seems like I'm riding the edge of what I can do with the
combination of a 5400 RPM drive and the -xM (at least until the clock
speed can get bumped back up to 1GHz which will probably help to a
certain extent) so I'm trying not to throw any additional work at the
board and drive, such as swap, since I fear I'm already perilously
close to driving right off the edge of the cliff from a performance
standpoint.

Regards,

--
Robert Nelsonhttp://www.rcn-ee.com/

Thanks,
Phil

Phil · November 23, 2010, 6:42am

In my excitement of actually capturing a more detailed picture of this
problem, I mistakenly said free memory dropped when it actually
jumped. But you can see things got a little weird after that with
cache dropping like a stone and si jumping... which I believe is a
snapshot of the intermittent scenario that's been causing me
heartburn. Now to figure out what's causing it...

Thanks,
Phil

RobertCNelson · November 24, 2010, 1:29am

Actually i can give you that speed bump back to 1Ghz, i've only tested
it over the weekend and my xM's running about 115F under full load:

You'll have to build the kernel yourself, using this easy script:

https://code.launchpad.net/~beagleboard-kernel/+junk/2.6-stable

in "patch.sh" uncomment line 105

http://bazaar.launchpad.net/~beagleboard-kernel/%2Bjunk/2.6-stable/annotate/head:/patch.sh#L105

#patch -s -p1 <
"${DIR}/patches/beagle/0001-omap-beagle-set-core-voltage-to-1.35-for-xM.patch"

Then, drop the setenv mpurate 800 in boot.scr with that new uImage and
modules generated..

I figure another week or two of hard testing before i enable it by default..

BTW, My test beagle's smsc95xx survived the weekend with no errors
with 2.6.37-rc2, so something's wrong with 2.6.35/36 with that
driver..

Regards,

Phil · November 24, 2010, 6:39am

Actually i can give you that speed bump back to 1Ghz, i've only tested
it over the weekend and my xM's running about 115F under full load:

You'll have to build the kernel yourself, using this easy script:

https://code.launchpad.net/~beagleboard-kernel/+junk/2.6-stable

in "patch.sh" uncomment line 105

http://bazaar.launchpad.net/~beagleboard-kernel/%2Bjunk/2.6-stable/an…

#patch -s -p1 <
"${DIR}/patches/beagle/0001-omap-beagle-set-core-voltage-to-1.35-for-xM.pat ch"

Then, drop the setenv mpurate 800 in boot.scr with that new uImage and
modules generated..

I figure another week or two of hard testing before i enable it by default..

BTW, My test beagle's smsc95xx survived the weekend with no errors
with 2.6.37-rc2, so something's wrong with 2.6.35/36 with that
driver..

That's great news and I'll be checking that out soon... it looks like
in some situations, I'm definitely running out of CPU cycles.

That also brings to mind a somewhat related question (at least re: the
kernel and performance), when I turn off video via:
echo 0 > /sys/devices/platform/omapdss/display0/enabled
echo 0 > /sys/devices/platform/omapdss/display1/enabled
... do you know if that is reducing the amount of bandwidth used on
the memory bus that can be used as needed by the USB interface etc?
I'm unclear as to how much circuitry is actually getting shut down by
turning off the display.

Regards,

--
Robert Nelsonhttp://www.rcn-ee.com/

Thanks,
Phil

Andrew_Burgess · November 24, 2010, 11:37am

it would be interesting to write a little program that tried
to show the memory bandwidth changing. maybe just a memory
test that printed it's speed? or something less cache friendly.

Phil · November 24, 2010, 7:00pm

To try to get some sort of ballbark comparison, I installed sysbench
and ran the memory test (sysbench --test=memory --memory-block-
size=10M --memory-total-size=50G run) and while it didn't quite do
what I asked it to (it only transfered 2G despite the 50G setting), it
consistently reported ~825M/s with the display turned on and ~925M/s
with the display turned off (the numbers bounced around by +/-5 meg
between runs on an otherwise idle system). If these numbers are
correct, in addition to the power savings, you can get at a bit more
bandwidth available for other purposes if you don't need video out.

pdedross · January 18, 2011, 11:40am

Has anyone had any luck in fixing this problem? I having exactly the
same problem on my beagle-xm-rev-b which is also running mythbackend.
I am running 2.6.35.9-l9 (http://rcn-ee.net/deb/maverick/v2.6.35.9-l9)
that came with the prebuilt maverick build (http://rcn-ee.net/deb/
rootfs/maverick/ubuntu-10.10-r3-minimal-armel.tar.7z) , with the clock
limited to 800Mhz.

Which kernel version do I need to use to be able to safely up the
mpurate to 1Ghz? (Or is the "set-core-voltage" patch still not enabled
by default? I had trouble finding the reference to this patch in the
latest patch.sh, has it been superseeded by a new patch?)

Is that latest thinking that the problem is caused by oom, leading to
this massive number of log entries, which then saturates the CPU?

Many thanks, Paul

RobertCNelson · January 18, 2011, 2:00pm

Hi Paul,

Yeah there's better stuff that's hit linux-omap lately for 720 on c4,
and 800/1000 for the xM so i've been merging that..

But right now, I'm tracking down a usb issue with the smsc lan chip..

smsc95xx 1-2.1:1.0: usb1: kevent 2 may have been dropped

It can overload dmesg pretty quick:

http://rcn-ee.homeip.net:81/dl/gcc/sys/dmesg-168887-2.6.37-x1-beagle-xma-512mb.log

Regards,

pdedross · January 19, 2011, 10:02am

Thanks Rob,

Is there anything I can do to help? I would be happy to test any
candiate releases that attempt fix the issue as I can readily recreate
it. Do you have a suspicion as to where the problem might be (driver
for the smsc95: smsc95xx.c? or higher up the stack, perhaps
usbnet.c?), I'm not an experienced kernel developer, but I don't
having a look to see what I can find.

Thanks, Paul

pdedross · January 19, 2011, 10:07am

Following on, I've just noticed the following comment above the
function that is issuing the "smsc95xx 1-2.1:1.0: usb1: kevent 2 may
have been dropped" error message (usbnet_defer_kevent):

/* some work can't be done in tasklets, so we use keventd

pdedross · February 13, 2011, 1:18pm

These guys are also experiencing the problem, and have some some good
investigation work. It sounds like the smsc95 driver might be
allocating too much memory, which results in the system running short
of larger pages. They suggest increasing /proc/sys/vm/min_free_kbytes
value to 4096 and setting vm.min_free_kbytes = 4096 in sysctl.conf as
a quick and dirty work-around for the problem. I'll give it a go
tonight a report the results. Has anyone else had any joy in taming
this problem.

Thanks, Paul

RobertCNelson · February 13, 2011, 5:29pm

Hi Paul,

Tried this on my panda..

voodoo@panda-a1-1gb:~$ cat /proc/sys/vm/min_free_kbytes
4096
voodoo@panda-a1-1gb:~$ cat /etc/sysctl.conf | grep free
vm.min_free_kbytes = 4096

Still fighting the smsc95xx.

voodoo@panda-a1-1gb:~$ dmesg | tail
[ 4861.514007] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.520812] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.527679] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.534454] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.541198] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.547912] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.554626] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.561370] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.568084] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped
[ 4861.574798] smsc95xx 1-1.1:1.0: usb1: kevent 2 may have been dropped

Regards,

Aaron1 · February 13, 2011, 11:11pm

I've been running into the same problem, with myth-backend dropping
packets and logs getting distressingly large. I tried the
min_free_bytes trick, and, like Robert, it didn't help at 4096.
*However*, at min_free_bytes=8192, my problems seem to have
disappeared, or at least gotten significantly better.

I was getting the "kevent 2 may have been dropped" messages every few
seconds. I changed the setting to 8192 in the middle of a recording
and so far it's been over an hour of HD recording with no logs of that
sort at all. I'll keep an eye on this over the next few days, but
setting min_free_bytes to 8192 seems promising to me.

aaron

RobertCNelson · February 14, 2011, 12:12pm

Thanks Aaron,

Looks like 8192 did the trick, 13 hours into a gcc build/regression
test, no kevent's on my panda.. Normally it would fire one every
couple of minutes..

Regards,

Phil · February 16, 2011, 4:26am

Aaron,

Thanks for the tip! I made the same change on my -xM and while I'm
not certain the kevent issue is completely resolved (I saw some kevent
log messages after the change), it has at the very least dramatically
improved my situation. Not a single case of the board becoming
unresponsive in about 2 days.

Thanks again,
Phil

Ivan_Lisenkov · March 13, 2011, 12:33pm

Hi everyone,

Any progress with this issue?

I'm suffering from this on high loads both ethernet and usb-storage
(about 2.3 MBps download to usb disk). Increased min_free_bytes to
30000, but eventually the message appers again.

I'm using 2.6.37.3-x4 #1 SMP Wed Mar 9 22:37:01 UTC 2011 armv7l GNU/
Linux