Custom Brd Kern. Boot - Get "Unable to handle kernel paging request at virtual..." Occasionaly

Hi,

We just got our re-worked custom boards back from the fab house with the 5728, and on one of the boards on 1/6 power-ups, I got “… Unable to handle kernel paging request at virtual address 00002248 … and at ffffffec.” I THINK this occurs when Debian is being mounted as we normally see Debian messages after the vpe: couldn’t get firmware in the case of a successful boot. I’m not sure if this is a kernel crash, kernel panic, or a “kernel OOPS.”

Following is the console log:

Starting kernel …

[ 0.064925] /cpus/cpu@0 missing clock-frequency property
[ 0.064945] /cpus/cpu@1 missing clock-frequency property
[ 1.233031] dra7-pcie 51000000.pcie_rc: link is not up
[ 1.599305] omap_hsmmc 480b4000.mmc: no pinctrl state for sdr25 mode
[ 1.605719] omap_hsmmc 480b4000.mmc: no pinctrl state for sdr12 mode
[ 1.701868] omap_voltage_late_init: Voltage driver support not added
[ 2.008531] rtc-ds1307 2-006f: hctosys: unable to read the hardware clock
Loading, please wait…
[ 2.792302] mmc2: error -22 whilst initialising SDIO card
[ 3.231127] remoteproc0: failed to load am57xx-pru1_0-fw
[ 3.242998] remoteproc0: request_firmware failed: -2
[ 3.248093] pru-rproc 4b234000.pru0: rproc_boot failed
[ 3.258661] remoteproc0: failed to load am57xx-pru1_1-fw
[ 3.264779] remoteproc0: request_firmware failed: -2
[ 3.269886] pru-rproc 4b238000.pru1: rproc_boot failed
[ 3.291919] remoteproc0: failed to load am57xx-pru2_0-fw
[ 3.299798] remoteproc0: request_firmware failed: -2
[ 3.304910] pru-rproc 4b2b4000.pru0: rproc_boot failed
[ 3.328397] remoteproc0: failed to load am57xx-pru2_1-fw
[ 3.339743] remoteproc0: request_firmware failed: -2
[ 3.344842] pru-rproc 4b2b8000.pru1: rproc_boot failed
rootfs: clean, 43747/232320 files, 390493/967040 blocks
[ 6.063694] remoteproc0: failed to load dra7-ipu1-fw.xem4
[ 6.069850] remoteproc1: failed to load dra7-ipu2-fw.xem4
[ 6.075464] remoteproc2: failed to load dra7-dsp1-fw.xe66
[ 6.081735] remoteproc3: failed to load dra7-dsp2-fw.xe66
[ 6.730836] pixcir_ts 4-005c: pixcir_set_power_mode: can’t read reg 0x33 : -121
[ 6.738239] pixcir_ts 4-005c: Failed to set IDLE mode
[ 7.021310] vpe 489d0000.vpe: couldn’t get firmware

************************************** LOOK HERE *********************************************************************************
[ 9.999478] Unable to handle kernel paging request at virtual address 00002248

[ 0.006772] pgd = c0004000

[ 10.009402] [00002248] *pgd=00000000

[ 10.013025] Internal error: Oops: 17 [#1] SMP ARM

[ 10.017747] Modules linked in: snd_soc_simple_card etnaviv snd_soc_omap_hdmi_audio ftdi_sio usbseris

[ 10.057143] CPU: 1 PID: 6 Comm: kworker/u4:0 Not tainted 4.4.110-ti-r142 #9
[ 10.064128] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 10.070251] Workqueue: events_unbound flush_to_ldisc
[ 10.075238] task: ee16a080 ti: ee188000 task.ti: ee188000

[ 10.080657] PC is at n_tty_receive_buf_common+0x84/0xa68

[ 10.085991] LR is at down_read+0x1c/0x4c
[ 10.089926] pc : [] lr : [] psr: 200f0013
[ 10.089926] sp : ee189e18 ip : ee189e00 fp : ee189e84
[ 10.101449] r10: ed4b7c00 r9 : ee03d000 r8 : ee03d014
[ 10.106691] r7 : ed4b7c00 r6 : ed4b7d84 r5 : ed4b7c80 r4 : c0af0c30
[ 10.113241] r3 : 00002000 r2 : 00000000 r1 : ee69e0a0 r0 : 00000000
[ 10.119793] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 10.126954] Control: 10c5387d Table: ad53c06a DAC: 00000051
[ 10.132720] Process kworker/u4:0 (pid: 6, stack limit = 0xee188218)
[ 10.139009] Stack: (0xee189e18 to 0xee18a000)
[ 10.143380] 9e00: ed4b7c80 ee189e28
[ 10.151589] 9e20: c00867a4 00020001 ed4b7d84 55555556 ee16a44c c1011a48 c0af0c30 c100d300
[ 10.159798] 9e40: 00002000 00000000 00000000 ee69e0a0 00000003 00000000 ee189e8c ee69e000
[ 10.168008] 9e60: 00000001 ee03d004 ed4b7c00 ee03d014 ee03d000 c0676d4c ee189e9c ee189e88
[ 10.184425] 9ea0: c0aac9a0 ee5b09c0 ee189ed4 ee03d004 ee0b5f80 ee03fc00 00000000 ee022b00

[ 10.192635] 9ec0: c10e5fd8 ee022b05 ee189f14 ee189ed8 c005fee4 c067a968 ee188000 c0061030
[ 10.200846] 9ee0: ee189efc 00000000 c0061030 ee0b5f80 ee0b5f98 ee03fc00 00000088 ee03fc14
[ 10.209055] 9f00: ee188000 ee03fc00 ee189f54 ee189f18 c0060284 c005fd9c c1033e04 c0d48010
[ 10.217264] 9f20: ee188000 c10e5bb9 00000000 00000000 ee0b7400 ee188000 ee0b5f80 c0060224
[ 10.225473] 9f40: 00000000 00000000 ee189fac ee189f58 c00664a8 c0060230 00000000 2dd1a000
[ 10.233683] 9f60: ee0b5f80 00000000 00000000 ee189f6c ee189f6c 00000000 00000000 ee189f7c
[ 10.241892] 9f80: ee189f7c dc8ba66e c00747b4 ee0b7400 c0066390 00000000 00000000 00000000
[ 10.250101] 9fa0: 00000000 ee189fb0 c0010ef8 c006639c 00000000 00000000 00000000 00000000
[ 10.258310] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 10.266518] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 08000000
[ 10.274732] [] (n_tty_receive_buf_common) from [] (n_tty_receive_buf2+0x24/0x2c)
[ 10.283903] [] (n_tty_receive_buf2) from [] (flush_to_ldisc+0xf0/0x13c)
[ 10.292290] [] (flush_to_ldisc) from [] (process_one_work+0x154/0x494)
[ 10.300588] [] (process_one_work) from [] (worker_thread+0x60/0x548)
[ 10.308712] [] (worker_thread) from [] (kthread+0x118/0x130)
[ 10.316140] [] (kthread) from [] (ret_from_fork+0x14/0x3c)
[ 10.323392] Code: e50b2030 e50b3034 eb10e2f3 e51b3044 (e5930248)
[ 10.329557] —[ end trace 406ed7c803d6bcc7 ]—

[ 10.335007] Unable to handle kernel paging request at virtual address ffffffec

Somebody reported a similar issue here:

https://e2e.ti.com/support/arm/sitara_arm/f/791/t/657207.

In reading this thread from E2E, TI had suggested to upgrade to a later SDK, and I wasn’t 100% clear on whether the person who posted the post actually tried upgrading. One of the TI employees suggested that upgrading from the 4.4.32 kernel to the 4.9.59 kernel could be related to the crash.

Our console image consists of the following:

debian@BeagleBoard-X15:/$ uname -r
4.4.110-ti-r142

debian@BeagleBoard-X15:/$ cat /etc/debian_version
8.10

debian@BeagleBoard-X15:/$ cat /etc/dogtag
BeagleBoard.org Debian Image 2018-01-01

I have some basic questions related to this issue:

  1. Have you seen this before on the BB-X15? Was it an intermittent problem? If you’ve seen this, what’s the cause and resolution?

  2. Does the kernel message, [ 10.017747] Modules linked in: snd_soc_simple_card etnaviv snd_soc_omap_hdmi_audio ftdi_sio usbseris,
    mean that the problem is contained or enountered in only these modules?

  3. Once my kernel crashes/panics/OOPS’s, is there a good way to get further information about the crash, post-mortem? For instance, does the kernel still update files in the filesystem which can be examined on the SD card after the crash?

Thanks in advance!

Jeff

I cross posted in the TI E2E thread referenced below, since someone else encountered this issue.

One thing which comes to mind is, even though our custom board is based on the BB-X15/572xEVM, I have not yet “tuned”/re-computed the IO delays for u-boot-spl for our custom board. Rather, I’m re-using the default IO delays from the TI pinmux tool (or from the u-boot for the BB-X15).

The TI support engineer indicated that “weird things could happen, in certain cases if you don’t tune the IO delay values for your custom board.”

Wonder if this is the next, best place to look???

Hi Jeff,

What eeprom value did you program? and did you mirror the memory from
teh x15 design?

Regards,

Hi Robert,

We’ve copied the 572x EVM, reva3 schematic as closely as possible, but then added some peripherals.

Our custom board is utilizing 4, Micron, MT41K256 512 MiB DDR3 chips for DDR3. I believe this is on Gerald’s BOM for the BB-X15.

Right now our EEPROM is blank. My strategy, for the time being, has been to ignore the blank EEPROM for now and fool/hard-code the test of the board type to always return true for our custom board.

Then I added extra conditionals, where board type is tested, to all/most of the routines in board.c.

When the routines in board.c test positive for our custom board type, we “mostly” follow the same path that was followed for the BB-X15 in an “older version” of the 2017.01 u-boot/u-boot-spl. The major difference is we’re loading our own arrays for pad configuration and iodelay in recalibrate_iodelay in board.c. These arrays were obtained from the TI pinmux tool and pinmux design for our board.

I have not tuned the IOdelay values from the pinmux tool to account for any differences in timing between the address and data lines to the DDR3 on the BB-X15 vs. our custom board.

void emif_get_dmm_regs(const struct dmm_lisa_map_regs **dmm_lisa_regs)

{
.
.

if ( board_is_am572x_custom() )
*dmm_lisa_regs = &beagle_x15_lisa_regs;

.
.

}

Thanks!!

Hey Robert,

The EEPROM is mostly board ID, serial number, and Ethernet MAC addresses right? There isn’t anything which is DRAM-specific in EEPROM, right?

My understanding is the EEPROM contains the board ID which u-boot-spl and u-boot use to customize the configurations for each specific board… I understand that most of the board-specific changes are contained in the board.c file, but I’m wondering if there are other critical configurations in the /arch/arm/mach-omap2 directory. I also added some configurations for our custom board type to the Kconfig in the mach-omap2 directory. This looks to be menu options for a makemenuconfig GUI like for the kernel??

Have asked the TI folks on the above-referenced E2E post if there’s a DDR3 test tool which can be run for an extended test.

The guy on E2E who’s having a similar issue, implied that his firmware engineers already tuned the IO Delays. Maybe it’s something else??

I wonder who the people are who actually derive the IO delays (e.g. the GDELAY equtions) and tune them for boards like the BB-X15… How does that process work?? Maybe that’s in TI’s SPRAC44 ap note… I still wonder, how do you know when you’ve hit the sweet spot in terms of the optimum values??

This doesn’t appear to happen very frequently, but it’s probably prudent to get a handle on what makes it crop up…

Hi Jeff,

Hey Robert,

The EEPROM is mostly board ID, serial number, and Ethernet MAC addresses
right? There isn't anything which is DRAM-specific in EEPROM, right?

My understanding is the EEPROM contains the board ID which u-boot-spl and
u-boot use to customize the configurations for each specific board... I
understand that most of the board-specific changes are contained in the
board.c file, but I'm wondering if there are other critical configurations
in the /arch/arm/mach-omap2 directory. I also added some configurations for
our custom board type to the Kconfig in the mach-omap2 directory. This
looks to be menu options for a makemenuconfig GUI like for the kernel??

That's correct. The board ID from the EEPROM is used in the SPL to
select the matching DRAM configuration for the known board. In your
case you've hard-coded it to go down the x15 path. Since you mirrored
the x15 DRAM layout, this 'should' work.

Have asked the TI folks on the above post if there's a DDR3 test tool which
can be run for an extended test. The guy who's having a similar issue,
implied that his firmware engineers

Regards,

We got our magic numbers directly from TI..

Thus, i'm not sure how they were generated...

Regards,

Thanks Robert!

Will try to keep everyone updated on here or reference developments on the “parallel” E2E thread.

Regards, jeff

One of the guys on TI E2E who’s team is looking into this issue for their custom board, noted some things from the console log on TI E2E. I’ve pasted that here. I’m not sure if this rings a bell to anyone or provides additional insight as to DDR3 timing, etc.

"
.
.
My colleague went back through the logs (interns are great resources!). Here’s what he found. We seem to have two failure modes.

I went through the log files of 5 failures that display “Unable to handle kernel paging request” error message.

The system first displays the message PC is at n_tty_receive_buf_common+0x7c/0xa60 and then sometimes it also displays PC is at kthread_data+0x10/0x18 at the same startup after few lines and sometimes it just displays the message PC is at n_tty_receive_buf_common+0x7c/0xa60

Here is what I can see form the log files, sometimes the system displays unable to hande kernel paging request at 00002248 and also at ffffffec, whenever the address is 00002248 – the PC is at n_tty and when the address is ffffffec – the PC is at kthread.

The pgd is always at pgd = c0003000 in case of both the error messages.
*pgd is different for both the cases but same for all the failures.
*pgd=80000080004003
*pgd=80000080007003

.
.
.
"