init-eMMC-flasher mmcqd allocation failures

Using a script derived from init-eMMC-flasher-v3.sh to rsync from a loopback-mounted file system to a newly created ext4 file system on the eMMC I periodically (~ 5% of attempts) get the failures below. This has been replicated with kernels 4.1.16-ti-rt-r44 and 4.4.9-ti-r25 (both from pre-build console images), and on multiple BeagleBone Black revision C devices. The effect is that the resulting partition has nulls in some files where there should be data.

Note that this is writing to the eMMC, not to an SD card, and the script process is running as init. (The source image is sometimes on an SD card, and sometimes on an NFS-mounted root partition.)

The behavior and context is very much like the following topic but the kernel is much more recent and the device is being fed by a 5VDC 10A supply so should not be due to voltage sags.

https://groups.google.com/d/topic/beagleboard/C__ixv3bBBI/discussion

Can anybody provide a clue as to why this might happen, and if something can be done to prevent it?

Thanks.

Peter

Captured diagnostics over serial console running 4.4.9-ti-r25:

No idea what the problem might be, but I’d probably try dd instead.

Thanks; that’s the approach I’m going with.

For subject completeness . . . You could even use dd to stuff the bootloaders into the MBR, and then tar to extract over a rootfs. But, this is slightly more complex. Personally I tend to like the less complex options. dd also has the advantage of a 1:1 bit copy where tar . . . yeah not so sure about that.

Right. . . using tar would also require using fdisk, or other suitable partitioning tool.

My hypothesis is that the underlying cause is a true failure that’s tickled by running rsync in the shell script init process that provisions the device. The images I’m installing have a huge number of very small files, and naive interpretation of the kernel messages suggests it can’t handle the allocation associated with a new file without releasing memory that it can’t release at that point. Either the failure isn’t communicated back to rsync or rsync ignores it and just leaves the file contents nulled. Detecting the failure after provisioning completes is difficult.

Since I’m currently using pre-built Debian file systems/kernels/packages I can’t try to isolate and fix the true issue, so my hope is moving to an installation technique that avoids the write-many-small-files behavior fixes the issue until I have a chance to switch to Yocto and get better control over what the system does. If the problem recurs, I do still have the option of doing a md5sum over the entire partition to see whether a given dd invocation was successful, something I couldn’t do with tar or rsync.

Peter

I’ve also run into this once during an apt-get upgrade, leaving the system in a pretty hosed state. I managed to recover it but it required a lot of effort. Although evidently rare, this is clearly a very serious issue.

The direct cause is edma_prep_slave_sg() failing to allocate memory for the struct edma_desc. I don’t know whether the kernel is genuinely out of memory or if it simply cannot free it up immediately (the allocation is done with GPF_ATOMIC), and whether this is because it is filled with a backlog of writes or whether a leak of some sort is going on.

Instead of deferring until memory is available, or even just proceeding without DMA, the omap_hsmmc driver immediately fails the request with an error, thus pretty much guaranteeing loss of data or even filesystem corruption. I personally think this is completely unacceptable behaviour of a block driver.

I’ve been meaning to persue this matter on the linux-mmc and/or linux-omap lists, but since it was a single isolated incident and I have lots of other stuff to do I haven’t been able to find the time and motivation yet.

Matthijs

Instead of deferring until memory is available, or even just proceeding without DMA, the omap_hsmmc driver immediately fails the request with an error, thus pretty much guaranteeing loss of data or even filesystem corruption. I personally think this is completely unacceptable behaviour of a block driver.

I agree with you 100%. Completely irresponsible and unacceptable. Problem is . . . ‘open source’, or in this case where the term ‘open sores’ I think applies.

No idea if either of you ‘flash’ many devices at once. But it would be good to be 100% sure where, and why this issue occurs.

I have not experienced this issue at all. But quite honestly I think I’ve flashed the eMMC of only 3 different boards a few times each. In each case, rsync work flawlessly, and the whole copy operation was done in well under 2 minutes.

The image I’ve used: BBB-blank-debian-8.5-console-armhf-2016-06-19-2gb.img.xz

Powered by USB even, but a good known solid USB 3.0 port. When I check the PMIC registers, it says it set to 1300mA. So I have to assume that helps.

I do not know what else to add for the procedure I use. I use . . .

william@eee-pc:~$ uname -r
3.2.0-4-686-pae

To dd the image above to sdcard. Then it’s been just a simple matter of booting the given board with the sdcard inserted . …

But anyway, I would not rule out power related issues in any case. It’s not looking likely, but not impossible.

This is very useful information; thank you.

At this time I dd the entire image at provisioning time, and run the upgradable application from a loop-mounted image, so no update path involves writing many small files. However I intend to transition very soon to Yocto with package updates that are likely to make the problem visible again. At that time I’ll actually be able to control the kernel configuration and source better than under debian, so if the problem is still present I’ll see if I can propose a fix.

Peter