Possible eMMC firmware bug or hw issue - recent Seeed Studio BBBs with 6.1.x kernels

Anyone seeing eMMC problems with recent Seed Studio Beaglebone Blacks?

I have a very tentative hypothesis, that BBBs running the 6.1.x kernel branch (6.1.38-bone23 to be specific) are triggering a firmware bug on Kingston MK2704 firmware revision 0x0100000000000000 eMMCs, which are making the eMMCs’ internal controller crash, and sometimes bricking them entirely (write failures).

I’ll be doing more testing in the coming days, but if this rings a bell with anyone, please say…

Shell commands to get eMMC version info:

for i in name manfid hwrev fwrev serial rev ; do echo -n "eMMC $i: " ; cat /sys/block/mmcblk1/device/$i ; done

Cheers,

Tim.

Still having more failures with these, and not getting very far with the investigations.

Our client still hasn’t seen any failures with the 4.19 kernel (but until now most hardware running 4.19 kernel is an older production batch, albeit with the same eMMC firmware etc. so this might be relevant too).

I’ve noted that the MK2704 ( Kingston EMMC04G-MK27 ) are reporting fast wear rates in comparison to the previous M6704 ( Kingston EMMC04G-M627 ), e.g. 0x03 (which is I believe 30% - 40%) wear after only a week or two of light use (in comparison the M6704s all report 0x0 (which is I believe 0% - 10%).

In case it’s useful for anyone, I’m using this one-liner to gather info:

( echo "BBB hostname $HOSTNAME" ; echo -n 'BBB version ' ;  hexdump -e '8/1 "%c"' /sys/bus/i2c/devices/0-0050/eeprom -s 12 -n 4 ; echo ; echo -n 'BBB serial ' ; hexdump -e '8/1 "%c"' /sys/bus/i2c/devices/0-0050/eeprom -s 16 -n 12 ; echo ;  for i in name date rev hwrev fwrev oemid manfid life_time serial ; do echo -n "eMMC $i " ; cat /sys/block/mmcblk1/device/$i ; done ) | column -t

I’ve got two of these now, mostly for can testing, but they are in my ci…

*************************************************
cat /tmp/eMMC.log
eMMC name: MK2704
eMMC date: 02/2023
eMMC rev: 0x7
eMMC hwrev: 0x0
eMMC fwrev: 0x0100000000000000
eMMC oemid: 0x0100
eMMC manfid: 0x000070
eMMC life_time: 0x01 0x01
eMMC serial: 0x5a13c9d5
*************************************************
cat /boot/uEnv.txt
uname_r=5.10.168-ti-r73
cmdline=coherent_pool=1M net.ifnames=0 rng_core.default_quality=100
enable_uboot_overlays=1
disable_uboot_overlay_video=1

Regards,

1 Like

We saw same problem when Micron emmc changed to Kingston around 2015. Still have a big carton full of bricked emmc BBB:s. Have not had the problem lately. Running Linux 4.14.
/Sven

1 Like

Thanks. Currently it could be either a bad batch or a kernel/firmware interaction.

My client has a few EMMC04G-MK27 in the field (although most in the past year or two are M627).

The MK27 in the field have been running 4.19.x (most recently 4.19.232-bone75 for the past few years).

New production went to 6.1.38-bone23 since September, and the latest batch has been seeing high failure rates (~50%), however one of the MK27s in the field was recently upgraded from 4.19.232-bone75 to 6.1.38-bone23 kernel, and this failed within a couple of weeks (first entry below):

These are about half of the total failures (and are the ones that I have in front of me):

$ egrep -h 'eMMC.*(date|serial)' bad-eMMCs/*
eMMC  date       10/2021             
eMMC  serial     0x524c0752          
eMMC  date       02/2023             
eMMC  serial     0x52d2768c          
eMMC  date       02/2023             
eMMC  serial     0x52125caa          
eMMC  date       02/2023             
eMMC  serial     0x5a13da1b          
eMMC  date       02/2023             
eMMC  serial     0x5a13d14a          
eMMC  date       02/2023             
eMMC  serial     0x52d2751b          
eMMC  date       02/2023             
eMMC  serial     0x5a926deb          
eMMC  date       02/2023             
eMMC  serial     0x5153c7b7          
eMMC  date       02/2023             
eMMC  serial     0x51d3c42a          
eMMC  date       02/2023             
eMMC  serial     0x51d3bd7a          
eMMC  date       02/2023             
eMMC  serial     0x51d3c4d7

sorta month update… Been running 24/7…

*************************************************
cat /tmp/eMMC.txt
eMMC name: MK2704
eMMC date: 02/2023
eMMC rev: 0x7
eMMC hwrev: 0x0
eMMC fwrev: 0x0100000000000000
eMMC oemid: 0x0100
eMMC manfid: 0x000070
eMMC life_time: 0x02 0x01
eMMC serial: 0x5a13c9d5
*************************************************
cat /boot/uEnv.txt
uname_r=6.1.46-ti-r19
cmdline=coherent_pool=1M net.ifnames=0 rng_core.default_quality=100
enable_uboot_overlays=1
disable_uboot_overlay_video=1
uboot_overlay_pru=AM335X-PRU-UIO-00A0.dtbo
*************************************************
[39-am335x-bbb: 6.1.46-ti-r19 (up 1 hour, 48 minutes)]

reboot   system boot  6.1.46-ti-r19    Thu Jan 11 13:12   still running
reboot   system boot  6.1.46-ti-r19    Tue Jan  9 10:33 - 13:12 (2+02:38)
reboot   system boot  6.1.46-ti-r19    Tue Jan  9 09:55 - 10:09  (00:14)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:54 - 09:54 (3+23:00)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:24 - 10:54  (00:30)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:18 - 10:24  (00:05)
reboot   system boot  6.1.69-bone26    Fri Jan  5 10:13 - 10:18  (00:04)
reboot   system boot  6.2.16-bone17    Fri Jan  5 10:06 - 10:13  (00:07)
reboot   system boot  6.3.13-bone27    Fri Jan  5 10:02 - 10:06  (00:04)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:55 - 10:01  (00:05)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:54 - 09:55  (00:01)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:53 - 09:53  (00:00)
reboot   system boot  6.4.16-bone18    Thu Jan  4 11:52 - 09:52  (22:00)
reboot   system boot  6.4.16-bone18    Thu Jan  4 11:49 - 11:51  (00:02)
reboot   system boot  6.5.13-bone13    Wed Jan  3 14:37 - 10:50  (20:12)
reboot   system boot  6.5.13-bone13    Wed Jan  3 14:30 - 14:37  (00:06)
reboot   system boot  6.1.46-ti-r18    Fri Dec 29 15:48 - 14:30 (4+22:42)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:41 - 15:47  (01:06)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:37 - 14:40  (00:03)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:04 - 14:37  (00:33)
reboot   system boot  5.10.168-ti-r74  Sun Dec 17 15:53 - 14:03 (11+22:10)
reboot   system boot  5.10.168-ti-r73  Thu Nov  9 18:25 - 15:53 (37+21:27)

Regards,

Thanks Rob,

Our client returned 20 Beaglebones with failed eMMCs, and it sounds like it’s being investigated by Kingston. I’d love to know what the root cause is with this one. I’ll let you know if/when I get any more updates.

Best Regards,

Tim.

Do you know for fact that the emmc actually did fail? Smells like a pile of BS to get their money back and you fell for it. If they had 20 failures that means every BB shipped from that lot would be having an issue.

Might be their code is wearing it out, that could be the real cause.

Yes. Some within hours, but most within a month. Same build on all other eMMC makes and models are fine for years. Logging and data recording goes to SD card, so eMMC wear is not significant.

Could also be poor mfg practice, we have not had any trouble with the G&H boards…

Those might not even be legally branded BB products. Aliexpress / amazon / ebay are loaded with “dealers”. Unless you have a chain of custody that is spotless its more than likely a reject or what ever and is passed off as the “real thing” by online peddlers. So much of the stuff from amazon and aliexpress is JUNK. They raid the trash dumpster or the mfg peddles the rejects on the online market places.

No it’s a fun problem with Kingston EMMC04G-MK27 on boards produced by Seeed for BeagleBoard.org … A generic eMMC Linux enhancement in v6.1.x has done something that is conflicting with Kingston’s eMMC.

Right now the best option is too stay on v5.10.x till it’s fully resolved by Kingston (i’m also waiting while they test)…

1 Like

The Seeed Studio BBBs were purchased through local distribution (Farnell/Element14).

Seeed Studio have been taking the issue seriously and handling it in a professional manner.

It’s worth remembering that a very large quantity of the world’s high end electronics are made in China. Whilst a lot of shoddy stuff is definitely also made there, at times in the past countries which had a reputation for cheaply produced, poor quality mass-manufactured goods included the US (19th century), Germany (19th c), Japan (early to mid 20th c), and S Korea (late 20th c) . Over time, standards improve, laws are strengthened, and production of poor quality and counterfeit products move elsewhere to newly industrialising countries.

As things currently stand, my experience has been that there are a wide range of companies and individuals. Some with good intent, others less so. Some highly competent, and some not. This has also been my experience with all of the other countries I’ve worked in, or with.

When I had a quick look into this a few months ago, I noticed a patch which had been contributed to the mainline kernel by a Samsung dev which tightened up some (polling I think?) timings to improve eMMC performance. This was not hardware platform specific change, but applied to the Linux mmc subsystem which is used across all Linux platforms for eMMC and sdcard access for all hardware controller types which allow direct mmc I/O (i.e. pretty much everything except USB storage adapters).

Was this the change that you were referring to? Depending on the outcome of Kingston’s investigations I suppose it might be worth looking to see if a quirk could be added to conditionally back this out when the impacted eMMCs are detected? IIRC the performance enhancement went in some time between 5.15 and 6.1.

Cheers,

Tim.

soo… when looking up where the mmc layer has been setting the quirks… i see…

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/core/quirks.h?id=f1738a1f816233e6dfc2407f24a31d596643fd90

mmc: core: disable TRIM on Kingston EMMC04G-M627
It seems that Kingston EMMC04G-M627 despite advertising TRIM support does
not work when the core is trying to use REQ_OP_WRITE_ZEROES.

We are seeing I/O errors in OpenWrt under 6.1 on Zyxel NBG7815 that we did
not previously have and tracked it down to REQ_OP_WRITE_ZEROES.

Trying to use fstrim seems to also throw errors like:
[93010.835112] I/O error, dev loop0, sector 16902 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2

Disabling TRIM makes the error go away, so lets add a quirk for this eMMC
to disable TRIM.

Wonder how close EMMC04G-MK27 and EMMC04G-M627 are?