Serial hang: Testing Paul's patches

Testing the patches

http://www.pwsan.com/omap/gptimer_workaround_3.tar.gz
http://www.pwsan.com/omap/read_die_ids.patch

boot log outputs:

<7>OMAP_TAP_IDCODE 0x0b7ae02f REV 0 HAWKEYE 0xb7ae MANF 017
<7>OMAP_TAP_DIE_ID_0: 0x00000000
<7>OMAP_TAP_DIE_ID_1: 0x00000000 DEV_REV: 0
<7>OMAP_TAP_DIE_ID_2: 0x00000000
<7>OMAP_TAP_DIE_ID_3: 0x00000000
<7>OMAP_TAP_PROD_ID_0: 0x00000000 DEV_TYPE: 0

This board has the serial hang without patches applied.

I tried patch series 3 in two kernel configurations. Without and with CONFIG_DEBUG_LL enabled. My default is without debug ll enabled. Then I wondered why I don't get OMAP_TAP output at boot and only with dmesg, so for a second try I enabled debug ll.

Result:

- With older patch series *2* yesterday I had a lot of "Timer workaround" outputs while typing at serial console (and after some time serial hang)

- With this patch series (*3*) I don't see any "*** GPTIMER missed match interrupt!" outputs. Independent of debug ll enabled or not (not sure if I missed anything or if this is intended).

- *Without* debug ll enabled I get serial hang (with patch series 3 applied) in < 10min doing something like in attachment

- *With* debug ll enabled, I couldn't get serial hang within ~20min doing similiar stuff like in attachment. Maybe it simply will take longer with debug ll enabled to hang. Or I had luck.

Most probably I did something wrong here, sorry then. But maybe it helps.

Anyway, many thanks to Paul looking at this!

Dirk

serial_hang_test.txt (40.2 KB)

uImage with those patches applied:

http://amethyst.openembedded.net/~koen/beagleboard/demo/uImage-2.6.26-r53-beagleboard.bin

Result:

Hello Dirk,

Result:

- *Without* debug ll enabled I get serial hang (with patch series 3 applied)
in < 10min doing something like in attachment

- *With* debug ll enabled, I couldn't get serial hang within ~20min doing
similiar stuff like in attachment. Maybe it simply will take longer with debug
ll enabled to hang. Or I had luck.

Most probably I did something wrong here, sorry then. But maybe it helps.

It does help, very much. Dirk, when your Beagle serial hangs again, could
you please send a Sysrq-q (break + q on serial console) and E-mail me the
GPTIMER register dump at the bottom? It will look something like this:

<3>GPT TCRR: ffff9c66
GPT TCRR: ffff9c66
<3>GPT TMAT: ffffbfff
GPT TMAT: ffffbfff
<3>GPT TISR: 00000000
GPT TISR: 00000000
<3>GPT TIER: 00000003
GPT TIER: 00000003
<3>GPT TCLR: 00000041
GPT TCLR: 00000041
<3>GPT TOCR: 00000000
GPT TOCR: 00000000
<3>GPT TOWR: 00000000
GPT TOWR: 00000000

thank you for the TAP data and the testing help,

- Paul

after 4 hours and 32 minutes:

GPT TCRR: 20a06241
GPT TMAT: ffffbfff
GPT TISR: 00000000
GPT TIER: 00000003
GPT TCLR: 00000041
GPT TOCR: 00000000
GPT TOWR: 00000000

regards,

Koen

Koen Kooi wrote:

It does help, very much. Dirk, when your Beagle serial hangs again, could
you please send a Sysrq-q (break + q on serial console) and E-mail me the
GPTIMER register dump at the bottom? It will look something like this:

<3>GPT TCRR: ffff9c66
GPT TCRR: ffff9c66
<3>GPT TMAT: ffffbfff
GPT TMAT: ffffbfff
<3>GPT TISR: 00000000
GPT TISR: 00000000
<3>GPT TIER: 00000003
GPT TIER: 00000003
<3>GPT TCLR: 00000041
GPT TCLR: 00000041
<3>GPT TOCR: 00000000
GPT TOCR: 00000000
<3>GPT TOWR: 00000000
GPT TOWR: 00000000

thank you for the TAP data and the testing help,

after 4 hours and 32 minutes:

GPT TCRR: 20a06241
GPT TMAT: ffffbfff
GPT TISR: 00000000
GPT TIER: 00000003
GPT TCLR: 00000041
GPT TOCR: 00000000
GPT TOWR: 00000000

After Paul explained me that I have to use 'ctrl-a f q' at minicom to send Sysrq-q (thanks!) with debug ll disabled I get serial hang after ~27min:

-- cut --
root@beagleboard:~# SysRq : Show Pending Timers
Timer List Version: v0.3
HRTIMER_MAX_CLOCK_BASES: 2
now at 1724156066894 nsecs

cpu: 0
  clock 0:
   .index: 0
   .resolution: 1 nsecs
   .get_time: ktime_get_real
   .offset: 1216686567818359375 nsecs
active timers:
  clock 1:
   .index: 1
   .resolution: 1 nsecs
   .get_time: ktime_get
   .offset: 0 nsecs
active timers:
  #0: <c03f3d98>, hrtimer_wakeup, S:01, do_nanosleep, xkbd/1580
  # expires at 1382835683593 nsecs [in -341320383301 nsecs]
  #1: <c03f3d98>, hrtimer_wakeup, S:01, do_nanosleep, ipaq-sleep/1571
  # expires at 1383926635742 nsecs [in -340229431152 nsecs]
  #2: <c03f3d98>, tick_sched_timer, S:01, tick_nohz_restart_sched_tick, swapper/0
  # expires at 1656406250000 nsecs [in -67749816894 nsecs]
  #3: <c03f3d98>, it_real_fn, S:01, do_setitimer, Xfbdev/1498
  # expires at 1656428782958 nsecs [in -67727283936 nsecs]
   .expires_next : 1382835683593 nsecs
   .hres_active : 1
   .nr_events : 146275
   .nohz_mode : 2
   .idle_tick : 1382828125000 nsecs
   .tick_stopped : 0
   .idle_jiffies : 138601
   .idle_calls : 686178
   .idle_sleeps : 668180
   .idle_entrytime : 1722605987548 nsecs
   .idle_waketime : 1656403564453 nsecs
   .idle_exittime : 1656403747558 nsecs
   .idle_sleeptime : 1663773041876 nsecs
   .last_jiffies : 173619
   .next_jiffies : 173620
   .idle_expires : 1382835937500 nsecs
jiffies: 173619

Tick Device: mode: 1
Clock Event Device: gp timer
  max_delta_ns: 2147483647
  min_delta_ns: 30517
  mult: 140737
  shift: 32
  mode: 3
  next_event: 1382835683593 nsecs
  set_next_event: omap2_gp_timer_set_next_event
  set_mode: omap2_gp_timer_set_mode
  event_handler: hrtimer_interrupt

GPT TCRR: 00aabe07
GPT TMAT: ffffbfff
GPT TISR: 00000000
GPT TIER: 00000003
GPT TCLR: 00000041
GPT TOCR: 00000000
GPT TOWR: 00000000
-- cut --

No "*** GPTIMER missed match interrupt!", though.

Doing 'ctrl-a f q' several times, "now at XXX nsecs" and GPT TCRR: ZZZZ still increases. The other GPT values stay the same.

Dirk

With all the hangs I see only TCRR differs, TMAT, TIET and TCLR are
always the same when hanging.

regards,

Koen

Koen Kooi wrote:

GPT TCRR: 00aabe07
GPT TMAT: ffffbfff
GPT TISR: 00000000
GPT TIER: 00000003
GPT TCLR: 00000041
GPT TOCR: 00000000
GPT TOWR: 00000000
-- cut --

No "*** GPTIMER missed match interrupt!", though.

Doing 'ctrl-a f q' several times, "now at XXX nsecs" and GPT TCRR:
ZZZZ still increases. The other GPT values stay the same.

With all the hangs I see only TCRR differs, TMAT, TIET and TCLR are
always the same when hanging.

Yesterday, Philip had this with Koen's image:

*** GPTIMER missed match interrupt! last load: ffff76e1
------------[ cut here ]------------
WARNING: at arch/arm/mach-omap2/timer-gp.c:73 omap2_gp_timer_interrupt+0x44/0x78()
Modules linked in: pegasus ipv6
[<c0037b88>] (dump_stack+0x0/0x14) from [<c0053de0>] (warn_on_slowpath+0x4c/0x68)
[<c0053d94>] (warn_on_slowpath+0x0/0x68) from [<c003db74>] (omap2_gp_timer_interrupt+0x44/0x78)
  r6:00000000 r5:c04059f8 r4:00000002
[<c003db30>] (omap2_gp_timer_interrupt+0x0/0x78) from [<c0079dac>] (handle_IRQ_event+0x3c/0x74)
  r5:00000000 r4:c03ff300
[<c0079d70>] (handle_IRQ_event+0x0/0x74) from [<c007b67c>] (handle_level_irq+0xd4/0xf0)
  r7:00000ab6 r6:00000000 r5:00000025 r4:c04086ec
[<c007b5a8>] (handle_level_irq+0x0/0xf0) from [<c0033048>] (__exception_text_start+0x48/0x64)
  r5:c04086ec r4:00000025
[<c0033000>] (__exception_text_start+0x0/0x64) from [<c00336b0>] (__irq_svc+0x30/0x80)
Exception stack(0xc03fbf08 to 0xc03fbf50)
bf00: a0000013 e671ecb1 220cb6b1 00000000 198e134f a0000013
bf20: 00000000 00000ab6 19254d38 00000ab5 0000004a c03fbfa4 c03fbee8 c03fbf50
bf40: c0069908 c006fdbc 60000013 ffffffff

  r7:00000ab6 r6:00000000 r5:d8200000 r4:ffffffff
[<c006fa60>] (tick_nohz_stop_sched_tick+0x0/0x390) from [<c0034a78>] (cpu_idle+0x44/0x80)
[<c0034a34>] (cpu_idle+0x0/0x80) from [<c0318f20>] (rest_init+0x58/0x6c)
  r5:c042804c r4:c043fe8c
[<c0318ec8>] (rest_init+0x0/0x6c) from [<c0008b64>] (start_kernel+0x24c/0x2a4)
[<c0008918>] (start_kernel+0x0/0x2a4) from [<80008034>] (0x80008034)
---[ end trace 4118c6862fc8eec1 ]---

<snip>

Khasim reports that switching from 32k timer to MPU timer eliminates
the hang.

regards,

Koen

Koen Kooi wrote:

Koen Kooi wrote:

GPT TCRR: 00aabe07
GPT TMAT: ffffbfff
GPT TISR: 00000000
GPT TIER: 00000003
GPT TCLR: 00000041
GPT TOCR: 00000000
GPT TOWR: 00000000
-- cut --

No "*** GPTIMER missed match interrupt!", though.

Doing 'ctrl-a f q' several times, "now at XXX nsecs" and GPT TCRR:
ZZZZ still increases. The other GPT values stay the same.

With all the hangs I see only TCRR differs, TMAT, TIET and TCLR are
always the same when hanging.

Yesterday, Philip had this with Koen's image:

*** GPTIMER missed match interrupt! last load: ffff76e1

<snip>

Khasim reports that switching from 32k timer to MPU timer eliminates
the hang.

Update from Khasim about 32k timer weirdness:

http://www.beagleboard.org/irclogs/index.php?date=2008-08-05#T16:27:58

Dirk