NAND and ECC question (first post)

Hi,

My name is Tom Pickard and I'm working on using the Beagle Board to
run some SDR (not GNURadio) applications. I've been experimenting with
running Angstrom from an SD card, but now I'm interested in moving all
the code into NAND since we aren't too hot on using SD for our system.

Following the guide at http://elinux.org/BeagleBoardNAND , I loaded x-
loader, u-boot and the kernel into NAND from the u-boot prompt. I am
still working on getting the file system right, but what I'm wondering
is how the ECC stuff works with NAND. I know that you specify HW/SW
ECC when writing the NAND from u-boot, but what about when writing
from within Linux?

How robust is the HW/SW ECC?

Thanks for your help.

pickard@gmail.com said the following on 03/03/2009 06:31 PM:

Hi,

My name is Tom Pickard and I'm working on using the Beagle Board to
run some SDR (not GNURadio) applications. I've been experimenting with
running Angstrom from an SD card, but now I'm interested in moving all
the code into NAND since we aren't too hot on using SD for our system.

Following the guide at http://elinux.org/BeagleBoardNAND , I loaded x-
loader, u-boot and the kernel into NAND from the u-boot prompt. I am
still working on getting the file system right, but what I'm wondering
is how the ECC stuff works with NAND. I know that you specify HW/SW
ECC when writing the NAND from u-boot, but what about when writing
from within Linux?

How robust is the HW/SW ECC?

Thanks for your help.

Lets get some concepts:
a) What is ECC? hamming code and it can detect upto 2 bit errors and
correct up to 1 bit errors.
b) on NAND flash, the geometry is as follows:
1 device is divided into n blocks
1 block is divided into m pages
and 1 page is divided into data area and spare area (a.k.a out of band
or oob area)

Now, for 2048 byte data area on NAND,
S/w ecc as implemented in linux and u-boot is for 256 byte. This
essentially means we have 1 bit correction and 2 bit detection for every
256 bytes of data.
h/w ecc as implemented in GPMC is for 512 byes: this means 1 bit
correction and 2 bit detection for every 512 bytes of data.

As obvious, for 2048 bytes of data, s/w ecc can detect and correct upto
8 bits.. detect upto 16bits errors, while h/w ecc would probably do half
as much.

Robustness wise, s/w ecc does twice as good as h/w ecc.. Now to the catch..
a) performance: s/w ecc is essentially a cpu intensive operation while
GPMC in h/w ecc does it on the fly
b) size: the number of bytes required to store the ecc data for s/w ecc
is double the number of bytes required for h/w ecc

So, for a given application, you'd need to choose the correct balance.

Hope this helped..

Regards,
Nishanth Menon

Not fully correct.
Basically it boils down what ecc algorithm you use.
A hardware implementation could be just as good as the std linux sw
one (or even better)
However if you use a better algorithm you will typically need more
bytes to store the data (and the amount available is limited).
As long as the ecc data fits in the room that is available in the OOB
area of the NAND (together with the other uses of that area) things
are fine.
Using half of it has no additional benefit. You cannot store more data
or so.

Frans (who spend quite some time optimizing the kernel ecc
implementation).

Basically it boils down what ecc algorithm you use.
hardware implementation could be just as good as the std linux sw
one (or even better)

one more additional thingy -> OMAP GPMC does'nt just have h/w ecc
which essentially generates the ECC values while you are writing the
data, but also has a prefetch engine, so your ARM's GPMC access is not
blocked while you are writing it.. dump your data as much as possible
to the nand, come read the ecc regs once you are done, voila, the ecc
values are ready for you ;).. I might consider h/w ecc provided by
GPMC anytime "better" in terms of performance to s/w ecc.. ofcourse I
do not have any such experience with other h/w ECC engines, but GPMC
sure seems to do it in a pretty smart manner to me :D..

However if you use a better algorithm you will typically need more
bytes to store the data (and the amount available is limited).
As long as the ecc data fits in the room that is available in the OOB
area of the NAND (together with the other uses of that area) things
are fine.
Using half of it has no additional benefit. You cannot store more data
or so.

That does remind me to ask.. considering u-boot, linux kernel today in
the context of OMAP3530 GPMC and a large page NAND (2048 data area),
a) does anyone have/care to share statistics on bit failure rate /
millions of access?
b) done any performance measurement numbers?
That could help us quantify the probability of failure and interest in
using 256byte ecc..

Frans (who spend quite some time optimizing the kernel ecc
implementation).

:slight_smile: :smiley:
Regards,
Nishanth Menon (who has no certificates :wink: )

I didn't do measurements and vendors are also not very good in
providing those data.
I guess it depends on the actual process and will vary from supplier
to supplier and to some extend even from chip to chip
The only thing I know is that MLC nand is much worse than SLC nand
(MLC encodes more bits in one cell by using different levels).

Frans.

With the GPMC HW ECC algorithm I know real world production numbers telling
about ~1-2 of 100.000 devices containing more bad bits than correctable by
the OMAP HW ECC scheme just after soldering at first SW download... The
number was comparable between two different flash vendors and for 128MB
devices... As seen the problem isn't huge, although it exists and shouldn't
be ignored. What this number doesn't reveal is the number of bit-errors due
to aging, reads and writes occurring afterwards...

With MLC the topic is completely different, but the ECC-algorithm used here
is as well able to correct more bit errors. Normally you use
Bose-Chaudhuri-Hocquenghem capable of correcting several bits (5 AFAIK) and
detecting even more...

More information about the different ECC algorithms and schemes can be found
in the TRM (struf98b.pdf) chapter "11.1.5.14.3 ECC Calculator"

Best regards
  Søren (who has spend several months fighting all kind of NAND flashes :-))

With the GPMC HW ECC algorithm I know real world production numbers telling
about ~1-2 of 100.000 devices containing more bad bits than correctable by
the OMAP HW ECC scheme just after soldering at first SW download... The
number was comparable between two different flash vendors and for 128MB
devices... As seen the problem isn't huge, although it exists and shouldn't
be ignored. What this number doesn't reveal is the number of bit-errors due
to aging, reads and writes occurring afterwards...

I'm surprised by this. Most vendors I am aware of will do factory
testing and mark those blocks as bad before shipping them.
Also block 0 will be guaranteed to be good upon delivery (but it could
still fail upon first write I assume)

FM

Frans Meulenbroeks said the following on 03/06/2009 11:55 AM:

I'm surprised by this. Most vendors I am aware of will do factory
testing and mark those blocks as bad before shipping them.
  

Yes - the chip vendor will do that -> at least the ones from Micron on
beagleboard, I have seen them to be marked, in fact with the micron
ones, you could try to erase it, but the markers wont go away ;).. but I
think Soren was mentioning about the factory floor flashing of
bootloader at the board manufacturing plant.

Also block 0 will be guaranteed to be good upon delivery (but it could
still fail upon first write I assume)
  

I know the probability of this is pretty low, but yes, I have had the
fortune/misfortune to see 1 board (not beagle) with block 0 bad after a
few hours of usage :frowning: but i have only seen 1 device in the last 2-3
years of playing with nand devices.. but, one is good enough enough
number for me to believe block 0 will go bad - someday! :(..

Regards,
Nishanth Menon

> Also block 0 will be guaranteed to be good upon delivery (but it could
> still fail upon first write I assume)

I know the probability of this is pretty low, but yes, I have had the
fortune/misfortune to see 1 board (not beagle) with block 0 bad after a
few hours of usage :frowning: but i have only seen 1 device in the last 2-3
years of playing with nand devices.. but, one is good enough enough
number for me to believe block 0 will go bad - someday! :(..

Block 0 will definitely go bad someday. There is no difference in
technology manufacturing block 0 or the other blocks, so it is
vulnerable to the same defects.
The only guarantee you get is that it is good on delivery.
For that reason it greatly helps if the loader can deal with a bad
block 0 (no sure what omap3 does, but a lot of devices cannot).
If a device cannot deal with a bad block 0, take caution and do not
upgrade it unless really needed.
Reading a block rarely gives rise to uncorrectable errors if the block
is written properly.

FM

Block 0 will definitely go bad someday. There is no difference in
technology manufacturing block 0 or the other blocks, so it is
vulnerable to the same defects.
The only guarantee you get is that it is good on delivery.
For that reason it greatly helps if the loader can deal with a bad
block 0 (no sure what omap3 does, but a lot of devices cannot).
If a device cannot deal with a bad block 0, take caution and do not
upgrade it unless really needed.
Reading a block rarely gives rise to uncorrectable errors if the block
is written properly.

I totally agree with your comments. The statistics I have is from a 128MB
image (actually a bit smaller since it need to fit even in spite of bad
blocks :slight_smile: being flashed in a brand new NAND device - The device turns bad
during the flashing procedure. The device was declared fully functional by
the manufacture before first flashing, even though they of cause had the
possibility to contain a certain number of bad blocks (other than block 0)
stated previously in the devices datasheet. I hope the info was clear this
time?

With respect to the OMAP and NAND booting, The ROM code will search the 4
first blocks for a valid boot image (more information can be found in the
TRM - chapter 25.4.7.4), thereby being able to boot even though block 0
should turn up being bad. I however don't think this is currently used by
the official x-loader method flashing method - Maybe we should improve the
guide with respect to this :slight_smile:

With respect to the factory marked bad blocks - Take care as well - At least
for some of the earlier devices it was possible to erase the blocks marked
bad by production. This is a severe "problem", since it might not turn up
bad immediately for a random write, since the production testing is normally
done with some special secret special stressing test-patterns...

Therefore: As a general rule of thumb - Take care not to erase bad block
markers by a mistake... :slight_smile: The info from Nishanth could however point in
the direction, that the newer generations won't allow you to erase the
factory marked bad blocks - Great :-)...

Best regards
  Søren

oh you could check that right away: in u-boot:
nand bad -> gives a list of bad blocks
nand scrub -> will force erase even marked bad blocks.. (NOT RECOMMENDED)
then try nand bad again..
when I was working with the nand driver in u-boot, this was a
blessing.... but i guess I knew what the heck i was doing..
Regards,
NM

and of course if you are playing with things like this it is always a
good plan to do a nand bad and print that (or at least store it
somewhere) so if you accidentally erase the bad block info you can
mark them a s bad again (often the factory test for bad blocks is more
rigid than what a user app does).

FM