Possible eMMC firmware bug or hw issue - recent Seeed Studio BBBs with 6.1.x kernels

TimSmall · November 28, 2023, 5:42pm

Anyone seeing eMMC problems with recent Seed Studio Beaglebone Blacks?

I have a very tentative hypothesis, that BBBs running the 6.1.x kernel branch (6.1.38-bone23 to be specific) are triggering a firmware bug on Kingston MK2704 firmware revision 0x0100000000000000 eMMCs, which are making the eMMCs’ internal controller crash, and sometimes bricking them entirely (write failures).

I’ll be doing more testing in the coming days, but if this rings a bell with anyone, please say…

Shell commands to get eMMC version info:

for i in name manfid hwrev fwrev serial rev ; do echo -n "eMMC $i: " ; cat /sys/block/mmcblk1/device/$i ; done

Cheers,

Tim.

TimSmall · December 5, 2023, 3:43pm

Still having more failures with these, and not getting very far with the investigations.

Our client still hasn’t seen any failures with the 4.19 kernel (but until now most hardware running 4.19 kernel is an older production batch, albeit with the same eMMC firmware etc. so this might be relevant too).

I’ve noted that the MK2704 ( Kingston EMMC04G-MK27 ) are reporting fast wear rates in comparison to the previous M6704 ( Kingston EMMC04G-M627 ), e.g. 0x03 (which is I believe 30% - 40%) wear after only a week or two of light use (in comparison the M6704s all report 0x0 (which is I believe 0% - 10%).

In case it’s useful for anyone, I’m using this one-liner to gather info:

( echo "BBB hostname $HOSTNAME" ; echo -n 'BBB version ' ;  hexdump -e '8/1 "%c"' /sys/bus/i2c/devices/0-0050/eeprom -s 12 -n 4 ; echo ; echo -n 'BBB serial ' ; hexdump -e '8/1 "%c"' /sys/bus/i2c/devices/0-0050/eeprom -s 16 -n 12 ; echo ;  for i in name date rev hwrev fwrev oemid manfid life_time serial ; do echo -n "eMMC $i " ; cat /sys/block/mmcblk1/device/$i ; done ) | column -t

RobertCNelson · December 6, 2023, 12:36am

I’ve got two of these now, mostly for can testing, but they are in my ci…

*************************************************
cat /tmp/eMMC.log
eMMC name: MK2704
eMMC date: 02/2023
eMMC rev: 0x7
eMMC hwrev: 0x0
eMMC fwrev: 0x0100000000000000
eMMC oemid: 0x0100
eMMC manfid: 0x000070
eMMC life_time: 0x01 0x01
eMMC serial: 0x5a13c9d5
*************************************************
cat /boot/uEnv.txt
uname_r=5.10.168-ti-r73
cmdline=coherent_pool=1M net.ifnames=0 rng_core.default_quality=100
enable_uboot_overlays=1
disable_uboot_overlay_video=1

Regards,

Sven_Norinder · December 6, 2023, 7:26am

We saw same problem when Micron emmc changed to Kingston around 2015. Still have a big carton full of bricked emmc BBB:s. Have not had the problem lately. Running Linux 4.14.
/Sven

TimSmall · December 6, 2023, 8:37am

Thanks. Currently it could be either a bad batch or a kernel/firmware interaction.

My client has a few EMMC04G-MK27 in the field (although most in the past year or two are M627).

The MK27 in the field have been running 4.19.x (most recently 4.19.232-bone75 for the past few years).

New production went to 6.1.38-bone23 since September, and the latest batch has been seeing high failure rates (~50%), however one of the MK27s in the field was recently upgraded from 4.19.232-bone75 to 6.1.38-bone23 kernel, and this failed within a couple of weeks (first entry below):

These are about half of the total failures (and are the ones that I have in front of me):

$ egrep -h 'eMMC.*(date|serial)' bad-eMMCs/*
eMMC  date       10/2021             
eMMC  serial     0x524c0752          
eMMC  date       02/2023             
eMMC  serial     0x52d2768c          
eMMC  date       02/2023             
eMMC  serial     0x52125caa          
eMMC  date       02/2023             
eMMC  serial     0x5a13da1b          
eMMC  date       02/2023             
eMMC  serial     0x5a13d14a          
eMMC  date       02/2023             
eMMC  serial     0x52d2751b          
eMMC  date       02/2023             
eMMC  serial     0x5a926deb          
eMMC  date       02/2023             
eMMC  serial     0x5153c7b7          
eMMC  date       02/2023             
eMMC  serial     0x51d3c42a          
eMMC  date       02/2023             
eMMC  serial     0x51d3bd7a          
eMMC  date       02/2023             
eMMC  serial     0x51d3c4d7

RobertCNelson · January 11, 2024, 9:11pm

sorta month update… Been running 24/7…

*************************************************
cat /tmp/eMMC.txt
eMMC name: MK2704
eMMC date: 02/2023
eMMC rev: 0x7
eMMC hwrev: 0x0
eMMC fwrev: 0x0100000000000000
eMMC oemid: 0x0100
eMMC manfid: 0x000070
eMMC life_time: 0x02 0x01
eMMC serial: 0x5a13c9d5
*************************************************
cat /boot/uEnv.txt
uname_r=6.1.46-ti-r19
cmdline=coherent_pool=1M net.ifnames=0 rng_core.default_quality=100
enable_uboot_overlays=1
disable_uboot_overlay_video=1
uboot_overlay_pru=AM335X-PRU-UIO-00A0.dtbo
*************************************************

[39-am335x-bbb: 6.1.46-ti-r19 (up 1 hour, 48 minutes)]

reboot   system boot  6.1.46-ti-r19    Thu Jan 11 13:12   still running
reboot   system boot  6.1.46-ti-r19    Tue Jan  9 10:33 - 13:12 (2+02:38)
reboot   system boot  6.1.46-ti-r19    Tue Jan  9 09:55 - 10:09  (00:14)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:54 - 09:54 (3+23:00)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:24 - 10:54  (00:30)
reboot   system boot  6.1.46-ti-r19    Fri Jan  5 10:18 - 10:24  (00:05)
reboot   system boot  6.1.69-bone26    Fri Jan  5 10:13 - 10:18  (00:04)
reboot   system boot  6.2.16-bone17    Fri Jan  5 10:06 - 10:13  (00:07)
reboot   system boot  6.3.13-bone27    Fri Jan  5 10:02 - 10:06  (00:04)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:55 - 10:01  (00:05)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:54 - 09:55  (00:01)
reboot   system boot  6.4.16-bone19    Fri Jan  5 09:53 - 09:53  (00:00)
reboot   system boot  6.4.16-bone18    Thu Jan  4 11:52 - 09:52  (22:00)
reboot   system boot  6.4.16-bone18    Thu Jan  4 11:49 - 11:51  (00:02)
reboot   system boot  6.5.13-bone13    Wed Jan  3 14:37 - 10:50  (20:12)
reboot   system boot  6.5.13-bone13    Wed Jan  3 14:30 - 14:37  (00:06)
reboot   system boot  6.1.46-ti-r18    Fri Dec 29 15:48 - 14:30 (4+22:42)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:41 - 15:47  (01:06)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:37 - 14:40  (00:03)
reboot   system boot  5.10.168-ti-r75  Fri Dec 29 14:04 - 14:37  (00:33)
reboot   system boot  5.10.168-ti-r74  Sun Dec 17 15:53 - 14:03 (11+22:10)
reboot   system boot  5.10.168-ti-r73  Thu Nov  9 18:25 - 15:53 (37+21:27)

Regards,

TimSmall · March 25, 2024, 10:29am

Thanks Rob,

Our client returned 20 Beaglebones with failed eMMCs, and it sounds like it’s being investigated by Kingston. I’d love to know what the root cause is with this one. I’ll let you know if/when I get any more updates.

Best Regards,

Tim.

foxsquirrel · March 25, 2024, 11:32am

Do you know for fact that the emmc actually did fail? Smells like a pile of BS to get their money back and you fell for it. If they had 20 failures that means every BB shipped from that lot would be having an issue.

Might be their code is wearing it out, that could be the real cause.

TimSmall · March 25, 2024, 11:49am

Yes. Some within hours, but most within a month. Same build on all other eMMC makes and models are fine for years. Logging and data recording goes to SD card, so eMMC wear is not significant.

foxsquirrel · March 25, 2024, 2:44pm

Could also be poor mfg practice, we have not had any trouble with the G&H boards…

Those might not even be legally branded BB products. Aliexpress / amazon / ebay are loaded with “dealers”. Unless you have a chain of custody that is spotless its more than likely a reject or what ever and is passed off as the “real thing” by online peddlers. So much of the stuff from amazon and aliexpress is JUNK. They raid the trash dumpster or the mfg peddles the rejects on the online market places.

RobertCNelson · March 25, 2024, 5:33pm

No it’s a fun problem with Kingston EMMC04G-MK27 on boards produced by Seeed for BeagleBoard.org … A generic eMMC Linux enhancement in v6.1.x has done something that is conflicting with Kingston’s eMMC.

Right now the best option is too stay on v5.10.x till it’s fully resolved by Kingston (i’m also waiting while they test)…

TimSmall · March 25, 2024, 6:24pm

The Seeed Studio BBBs were purchased through local distribution (Farnell/Element14).

Seeed Studio have been taking the issue seriously and handling it in a professional manner.

It’s worth remembering that a very large quantity of the world’s high end electronics are made in China. Whilst a lot of shoddy stuff is definitely also made there, at times in the past countries which had a reputation for cheaply produced, poor quality mass-manufactured goods included the US (19th century), Germany (19th c), Japan (early to mid 20th c), and S Korea (late 20th c) . Over time, standards improve, laws are strengthened, and production of poor quality and counterfeit products move elsewhere to newly industrialising countries.

As things currently stand, my experience has been that there are a wide range of companies and individuals. Some with good intent, others less so. Some highly competent, and some not. This has also been my experience with all of the other countries I’ve worked in, or with.

TimSmall · March 26, 2024, 8:57am

When I had a quick look into this a few months ago, I noticed a patch which had been contributed to the mainline kernel by a Samsung dev which tightened up some (polling I think?) timings to improve eMMC performance. This was not hardware platform specific change, but applied to the Linux mmc subsystem which is used across all Linux platforms for eMMC and sdcard access for all hardware controller types which allow direct mmc I/O (i.e. pretty much everything except USB storage adapters).

Was this the change that you were referring to? Depending on the outcome of Kingston’s investigations I suppose it might be worth looking to see if a quirk could be added to conditionally back this out when the impacted eMMCs are detected? IIRC the performance enhancement went in some time between 5.15 and 6.1.

Cheers,

Tim.

RobertCNelson · March 26, 2024, 1:24pm

soo… when looking up where the mmc layer has been setting the quirks… i see…

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/core/quirks.h?id=f1738a1f816233e6dfc2407f24a31d596643fd90

mmc: core: disable TRIM on Kingston EMMC04G-M627
It seems that Kingston EMMC04G-M627 despite advertising TRIM support does
not work when the core is trying to use REQ_OP_WRITE_ZEROES.

We are seeing I/O errors in OpenWrt under 6.1 on Zyxel NBG7815 that we did
not previously have and tracked it down to REQ_OP_WRITE_ZEROES.

Trying to use fstrim seems to also throw errors like:
[93010.835112] I/O error, dev loop0, sector 16902 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2

Disabling TRIM makes the error go away, so lets add a quirk for this eMMC
to disable TRIM.

Wonder how close EMMC04G-MK27 and EMMC04G-M627 are?

ppBBB · July 6, 2024, 5:21pm

Hi all,
I expect to have the same issue.
BBB with debian 12 minimal image, Kernel 6.1.
BBB bought from digikey.
4 BBBs seem to have damaged emmc’s after approx. 3 months.
After trying to reboot; systen hang up.
Booting from ssd and using beagle-flasher stops with IO errors.
BBB hostname bb02
BBB version 00C0
BBB serial 2241SBB10782
eMMC name MK2704
eMMC date 07/2022
eMMC rev 0x7
eMMC hwrev 0x0
eMMC fwrev 0x0100000000000000
eMMC oemid 0x0100
eMMC manfid 0x000070
eMMC life_time 0x01 0x01
eMMC pre_eol_info 0x01
eMMC serial 0x52a3b992

At the moment the systems are running from sdcard. (since2 days)
seem to work.
Do you expect the same issues with crashing the sdcards?
Should I downgrade to kernel 5.10 or go back to debian 11.8?
Any experience?

RobertCNelson · July 6, 2024, 5:22pm

Yes stay on v5.10.x, don’t run anything on the eMMC with v6.1.x… (i’m still wearing out eMMC’s in my long term tests of mmc quirk patches i’m testing)…

Regards,

ppBBB · July 6, 2024, 5:24pm

Thank you for your fast answer.
Is it an issue with the emmc memory or an issue with an error in newer kernels?

RobertCNelson · July 6, 2024, 5:26pm

Newer 6.1.x (era) kernels have a newer sdhci/emmc optimizations… older eMMC’s used on the BBB have proper quirks in mainline, the newer MK2704 is not responding the same way to the existing quirks on previous generations… I loose the drive in a few weeks… So i’m testing newer disable options, run a few weeks, either loose the eMMC (and need another bbb) or start another test…

Regards,

ppBBB · July 6, 2024, 5:42pm

Ok.
That also means the BBBs are bricked or only the emmc’s are killed, no way to reactivate them?
Do you think the BBBs are still ok when running from sdcard?
Are you expecting the same issue with sdcards?
Is it possible to get info about the state of the emmc? The above feedback from the emmc shows no issues. (eMMC life_time 0x01 0x01
eMMC pre_eol_info 0x01)
Do you expect any issues with kernel 5.15 and debian 11.x)
Do you think we have to replace all BBBs to prevent any further issues?

RobertCNelson · July 6, 2024, 5:46pm

and after a few week, those show long term wear vales. In our standard image, (ext4) they are wearing out way to fast… Weeks instead of years…

The fix is to get the correct set of mmc quirks such that the kernel will stop wearing them out prematurely. If you must run later kernels, make sure the eMMC doesn’t have an active partition, or in such a case that it isn’t mounted or used.

Regards,