Driving a Multiplexed LED matrix directly from gpio

David_Good · March 21, 2016, 1:28am

Hi All,

I’ve been experimenting with embedded linux and matrixed LED displays. I started on a Raspberry pi user space program, but could see visual artifacts on the display due to inconsistent timing of my sleep command. So, I figured that moving the basic row scanning into the kernel would help out. After failing to get the right kernel headers for the pi, I switched to a BeagleBone White. I’ve now got a working character device LKM which takes new images by writing ASCII formatted hex strings to the device in /dev. The performance is pretty good, but not great. I still see visible artifacts, but am playing with things.

My basic question is this: I know that Linux is not a RTOS, so timing will never be guaranteed, yet linux does a lot of things very quickly (video, audio, i2c, etc). My driver is bit-banging a spi like stream over 8 rows at a rate of ~3ms per row (333Hz row scanning or ~41Hz per complete frame) and is really struggling. How does linux usually get large smooth video at over 60FPS while doing other things??? Is it simply taking advantage of special hardware resources?

The obvious solution for this display is to use a little 8051 or M0 microcontroller (or PRU!) and let the Bone feed it over uart or something, but I really thought I could do better with the LKM.

Am I just doing something totally wrong? Any other ideas?

Thanks!

–David

William_Hermans · March 21, 2016, 3:58am

Show us some code.

William_Hermans · March 21, 2016, 4:03am

OK so yes, seeing your code would help us understand what you’re bottleneck but I’ve a pretty good idea why you’re LCD matrix refresh is so slow. It has to do with userspace communicating with kernel space, and you’re either invoking some system API calls ( which are notoriously slow ), or you’re copying data from userspace to kernel space, which again is slow.

The reason I said to show us some code above however is that I think it should be possible to use /dev/mem/ + mmap() to access whatever you’re controlling. Using mmap() in this manner gives you no userspace to kernel overheard . . . but again I need to see some relevant code to give you a better idea how that might work.

David_Good · March 21, 2016, 6:20am

Ok, I posted the full source code here:
https://github.com/davidgood1/ledmsgchar

I’m not sure that userspace has much to do with what I’m seeing right now. I’m using a kthread that runs until a flag is set indicating that there is a new buffer of data ready from user space and my task (update_row) copies the buffer, so no one should be waiting on slow user space operations, but maybe there’s something I don’t see.

I didn’t know if I was getting interrupted in the memcpy call in update_row, so I am going to try to copy the buffer by rows one call at a time rather than copy the whole buffer at once.

Also, I notice that I see visual artifacts whenever anything else is going on, such as ssh, typing on the terminal, etc. The CPU usage is ~3% using the timing in the file right now. BTW, the update_row task regulates its timing by calling usleep_range() which is supposed to be backed by high resolution timers. I am using the same number on the upper and lower bounds to try and force stricter timing and because when I did +/- 10% you could definitely see more visual artifacts.

Also, you will notice that the strobe pin is toggled high and then immediately low, which I’ve measured to be about 2us. So, that seems to be the best the gpio interface can do.

Thanks!

–David

Harvey_White · March 21, 2016, 2:40pm

I've often seen artifacts in LED displays where the write to the
display process (which should be amazingly short for each digit) is
interrupted. You might be seeing some differences in digit brightness
as well.

Once you have the pattern in Ram, ready to be written, you would
ideally update right at the beginning of each scan for each digit. You
want to make sure that the pattern is not updated during the active
strobe to display process as well. (I'm assuming that there's a
hardware latch in there that you write, and you're not depending on
the output pins to be constant, that strobe to activate the display is
also a latch). This approach would minimize the critical time where
the display data cannot be interfered with. This would ideally be in
the microsecond range (i.e. turn off previous strobe, load next data,
turn on this strobe). That part of the code cannot be interrupted and
should ideally protected by turning interrupts off during those few
instructions.

So the first question to ask is "what's going on with the strobes and
data?"

Harvey

David_Good · March 21, 2016, 4:16pm

Hmm… I’m using UCN5821 driver chips from Allegro. They do indeed have a hardware latch (strobe) which latches the received serial data from the internal shift register to the outputs.

My write algorithm is (per row):
1> set_current_state(TASK_RUNNING)
2> If new data is ready in RAM, memcpy it to the real display buffer in RAM.

3> Serially shift out next row data in the background, but do not latch yet (background)

4> Blank the display - This is not actually working right now. I never see the blank line move. Hmm…

5> Adjust the multiplexed row select pins to the next row (A2…A0)

6> Latch the data already in the serial buffers

7> Unblank the display
8> set_current_state(TASK_INTERRUPTIBLE)

9> usleep until time to update next row

I wanted to spend a little time as possible with the LEDs off, which is why I’m shifting data while the LEDs are still showing the current row data. Using this technique, I’ve seen ghosting issues when a mcu is very fast as the drivers themselves need a little time to switch the state of their outputs, and also the high-side switching power transistors sometimes need some time to fully turn off. I don’t think that’s what I’m seeing here though.

The display artifacts that I see look like stuttering on the display when the processor gets busy. I suspect that this is due to inconsistent usleep times (Linux isn’t an RTOS) but I’m still trying to catch it with my logic analyzer.

My basic question is this: Does the Linux kernel “usually” do this kind of bit-banged driver for other things (I2C, video, framebuffers, audio, etc.) or does it “usually” pass these tasks off to hardware peripherals? The question behind the question is: Am I doing this the “usual” way or am I trying something very specialized.

My goal is to look into the framebuffer devices and see if I can learn anything there, but kernel programming is very new to me.

Thanks for your feedback!

–David

Harvey_White · March 21, 2016, 5:28pm

Hmm... I'm using UCN5821 driver chips from Allegro. They do indeed have a
hardware latch (strobe) which latches the received serial data from the
internal shift register to the outputs.

My write algorithm is (per row):
1> set_current_state(TASK_RUNNING)

For state 2, I'd be tempted to do a non-interruptible write to the
display memory (or protect it by a semaphore). You want to interlock
the process so that you have a safe zone to write new data to the
memory.

2> If new data is ready in RAM, memcpy it to the real display buffer in RAM.

This should not cause a problem unless the data changes somehow.

3> Serially shift out next row data in the background, but do not latch yet
(background)

This may be too fast for you to see, and you don't want it slow.

4> Blank the display - This is not actually working right now. I never see
the blank line move. Hmm....

I'm assuming you have the data for the new row in the buffer, of
course,

5> Adjust the multiplexed row select pins to the next row (A2...A0)
6> Latch the data already in the serial buffers

Problem with blanking and unblanking the whole display could cause
flicker.

7> Unblank the display
8> set_current_state(TASK_INTERRUPTIBLE)
9> usleep until time to update next row

I wanted to spend a little time as possible with the LEDs off, which is why
I'm shifting data while the LEDs are still showing the current row data.

That shouldn't be a problem at all, the chip is designed for it.

Using this technique, I've seen ghosting issues when a mcu is very fast as
the drivers themselves need a little time to switch the state of their
outputs, and also the high-side switching power transistors sometimes need
some time to fully turn off. I don't think that's what I'm seeing here
though.

Let's assume that it is a problem, so you can fix that.

The display artifacts that I see look like stuttering on the display when
the processor gets busy. I suspect that this is due to inconsistent usleep
times (Linux isn't an RTOS) but I'm still trying to catch it with my logic
analyzer.

I've had similar task interlocking problems with an RTOS, but I'd
suggest a slightly different approach.

My basic question is this: Does the Linux kernel "usually" do this kind of
bit-banged driver for other things (I2C, video, framebuffers, audio, etc.)
or does it "usually" pass these tasks off to hardware peripherals? The
question behind the question is: Am I doing this the "usual" way or am I
trying something very specialized.

Not an expert on Linux at all, but it depends on the data rates, and
whether or not there's hardware available to do it.

My goal is to look into the framebuffer devices and see if I can learn
anything there, but kernel programming is very new to me.

Kernel programming (for microprocessors) is not all that bad, but
there are things you probably don't think of when writing kernel level
drivers and code.

1) everything is asynchronous
2) this can be inconvenient
3) lots of mechanisms exist to keep this from happening

I'd be tempted to do the following approach: Divide the scan time
into one slot per digit plus one. Use that time to allow synchronized
refresh to the display itself. You may need to limit the on time
depending on your sink current (per driver's data sheet).

Assuming you use a semaphore, then while actively scanning digits, the
scanning process "owns" it. During the update time, the scanning
process gives it back. The update process can grab that, then can
update all the ram buffers that the scanning process uses. The update
process then gives back the semaphore and the scanning process can
grab it again during the next scan time. Result: display does all RAM
updates while blanked.

Now for the scan itself. Assuming the scan time has just happened,
you may want to disable task switching. shift in the new data, switch
the row drivers off, then latch the column data, then enable the new
row drivers. Since darlingtons are slow, I'd be looking at the
driving waveforms for the darlington outputs to the display for any
overlap. Shouldn't be much, though, if any. Some software delays of
a few microseconds may be needed. At this point, enable task
switching for the OS and you're in good shape.

The problem with multiplexed displays run directly by a microprocessor
is debugging during the scan cycle, where the average display current
can exceed the steady state limits for the display. I'd have been
tempted to put in a small CPLD which appears as external registers to
the processor, and then allow the hardware to do all the work. Xilinx
coolrunner II 32 or 64 cell chips are cheap, 3.3 volt I/O, and need
(IIRC) 1.5 volts. Not too bad and the display stays happy.

Harvey

David_Good · March 21, 2016, 7:31pm

So, let me see if I understand your idea:

My display looks like this:

R0 <Common data outputs: 144 pixels (18 bytes)>

…

R7

Data is written to the LED drivers while power is applied to one row. Then, new data is written and power is applied to the next row, etc.

“I’d be tempted to do the following approach: Divide the scan time
into one slot per digit plus one. Use that time to allow synchronized
refresh to the display itself.”

When you say “digit”, I assume you are thinking in terms of a multiplexed 7-segment readout style display, which in my case would be the same thing as a row. My driver operates on one row at a time before going to sleep. Are you suggesting that I scan through all 8 rows and then have a special 9th “row” time where I do things like the memcpy? This would be pretty close to a VSYNC idea, right? I suppose that this 9th “row” wouldn’t have to wait a full row time, but could schedule the next row sooner. Hmm…

You are totally right about MCUs and burning the LEDs. This particular display is safe because the LEDs are not being over-driven. Your idea about a CPLD is a good one. I’ve never used them, but have know about them. I found this XC2C32A for $1.8USD. I will probably try this out at some point just for educational purposes

http://www.digikey.com/product-detail/en/xilinx-inc/XC2C32A-6VQG44I/122-1704-ND/1952030

–David

Harvey_White · March 21, 2016, 8:23pm

So, let me see if I understand your idea:
My display looks like this:
R0 <Common data outputs: 144 pixels (18 bytes)>
...
R7 <Common data outputs>

Ah, ok, dot matrix then.

Data is written to the LED drivers while power is applied to one row.
Then, new data is written and power is applied to the next row, etc.

"I'd be tempted to do the following approach: Divide the scan time
into one slot per digit plus one. Use that time to allow synchronized
refresh to the display itself."

When you say "digit", I assume you are thinking in terms of a multiplexed
7-segment readout style display, which in my case would be the same thing
as a row. My driver operates on one row at a time before going to sleep.
Are you suggesting that I scan through all 8 rows and then have a special
9th "row" time where I do things like the memcpy? This would be pretty
close to a VSYNC idea, right? I suppose that this 9th "row" wouldn't have
to wait a full row time, but could schedule the next row sooner. Hmm....

Exactly, VSync or Hsync depending on what analogy you want. In this
case, Vsync would be right.

If you're doing an operating system, then your timing is really
generated by interrupts, so I'd be using a relatively high priority
hardware interrupt from a timer here, or if you want to do an
operating system, then just give the remaining time back to the OS so
it can pick the next task. No need to chew up one tick period.

In my OS, tasks suspended or delayed check the time (or semaphore) or
yield back the time, then the OS simply looks for the next active
task.

You are totally right about MCUs and burning the LEDs. This particular
display is safe because the LEDs are not being over-driven. Your idea
about a CPLD is a good one. I've never used them, but have know about
them. I found this XC2C32A for $1.8USD. I will probably try this out at
some point just for educational purposes

You would want the Xilinx free version of their web support (ISE),
which works fine those chips. You'll need to find the USB programming
cable as well, I got mine from Amazon.

There are times when it's nice to be able to throw hardware at a
problem. You have the driver done well enough, but IIRC, there are
some subsystems out there that do this. A simple Atmel Mega processor
would work well for this, if you wanted.

The nice thing about the XC2C32A and the Xc2C64A is that they have
exactly the same footprint. Once you get to 128 or 256 cells, you go
to a TQFP-100 package.

I'd recommend VHDL (simply because I like it) for the design language.
Remember that in VHDL you do not have to design components like chips,
but you can tell the chip "I want a divide by 37 counter" by
explaining how it counts, and then let the program built it. This
makes it an easier design process than designing the counter with
logic by itself.

Harvey

William_Hermans · March 21, 2016, 8:44pm

So, I’m just now getting back to you, and I see you all have been discussing things I have not fully read about yet. Anyway I will say that I have zero hands on with LKM’s, but I do have a decent amount of hands on with C in userspace. One thing that sticks out to me . . . https://github.com/davidgood1/ledmsgchar/blob/master/ledmsgchar.c#L359

msleep(n) where n is a value in microseconds. I honestly do not know how responsive timers are in kernel space, but in userspace high resolution timers do not work. It does not matter if one uses an RT kernel or not. So system API calls such as usleep() really do not work, as there is some system call overhead involved. In kernel space . . . again I have no hands on personally, but I’d be leery of any sleep() type call until I actually tested it on the specific platform I planned using it on.

William_Hermans · March 21, 2016, 8:54pm

Additionally just below the call to msleep() linked to above you have nested loops two deep so . . . http://stackoverflow.com/questions/24643432/big-o-nested-while-loop

David_Good · March 21, 2016, 8:55pm

Hi William!

Good eyes, but I don’t know if that will affect the row updates. This will affect user space programs trying to write to the character device and introduce a delay returning from write request, but since the row updates themselves are happening on a separate kthread, it shouldn’t be possible to disturb the critical timing.

I think that the delay actually produced by this call will be 1 “jiffy” which I think is ~20ms or more, but I didn’t think it mattered to make user-space wait this amount of time.

–David

William_Hermans · March 21, 2016, 9:08pm

Hi William!

Good eyes, but I don’t know if that will affect the row updates. This will affect user space programs trying to write to the character device and introduce a delay returning from write request, but since the row updates themselves are happening on a separate kthread, it shouldn’t be possible to disturb the critical timing.

I think that the delay actually produced by this call will be 1 “jiffy” which I think is ~20ms or more, but I didn’t think it mattered to make user-space wait this amount of time.

Hi David,

About the only way I can think of to “accurately” test this. Would be to write a similar LKM that toggles a single GPIO, and watch the results with a scope or logic analyzer. So when you say ~20ms" be wary that “ms” typically means milliseconds, which you probably meant to mean microseconds.

Again, I’m not sure of msleep() but it seems to be called every time the driver is waiting on the buffer. And I do not know how often this method in your module is called, but that potential for latency from the sleep() and the big O nested loop situation . . . is your buffer really a two dimensional array ? I’d definitely look into changing that into possibly a linear field that can have bit ops done on it effectively.

David_Good · March 21, 2016, 10:06pm

Now here’s where kernel space gets really weird. I did actually mean 20milliseconds. My first version of the LKM tried to use msleep(2), but only got 6Hz total image refresh rate. I measured the signals with a logic analyzer and found that I was waking up every ~20ms. Totally confused, I searched and found the answer here:

https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

It turns out that msleep doesn’t work for very low values of ‘m’ because they use legacy timers rather than the new high resolution ones.

The GPIO can be toggled through back to back function calls at about 2us.

About waiting on “the buffer”. There are two buffers:

1> active row data

2> temporary buffer that is filled by dev_write calls from user-space.

The active row data buffer is never touched by user space, so no one ever waits for it. The temp buffer is only used to get new image data from user space and once filled, is consumed by the update_row task. Again, unless I missed something, the update_row task is not affected. My driver should be getting multiple time slices of the CPU asynchronously during the time that the dev_write routine is waiting. But in actual use, I’ve never seen that this wait ever happens since it is only triggered if user-space is somehow able to write faster than my kernel thread can update the active row data. My test pattern application only writes a new pattern every 1 second, so I have been treating the LKM gently

I definitely will check out flattening the two dimensional array. I didn’t think that it would really impact performance since I’m not dereferencing it with multiple subscripts, but rather just passing buf[row] as a pointer into the serializer function. It probably would be easier to do effects and transitions on a linear array, but I probably won’t do too much of that in the kernel module itself. If the performance of writing to a character device is good, I would prefer to do all image processing and buffer manipulation in user space. But, experimentation will have to guide me there!

–David

CEinTX · March 21, 2016, 11:07pm

David,

Without seeing your circuit of how you are setting up your rows & columns to be driven, I’ll take a blind stab at your issue.
To get rid of artifacts / ghosting / etc…

Shift out all your data
Turn off your drivers and row power if that is available
Delay ~20us
Latch the data into your drivers
Delay ~20us or more
Change your row address
Enable your outputs and/or row power

You must have some dead time between rows or you will get artifacts / ghosting / whatever you choose to call it.
You might get away with less than 20us but also you might need more depending on your circuit.
Too much dead time and you will get flicker - not enough and you guessed it - artifacts/ghosting.

For what it seems like you are doing, I’d use the SPI interface to shift out your data in blocks of 16-bits - 9 xfers gets you 144 bits out.
You obviously could bit bang this but why when you have built-in hardware that will do it for you. I’d think it would be fast enough in an ISR. I should think less than 250 us.
Use the gpio for toggling your latch and output enable and addr/row select - these are low speed signals - so no problem
There are definitely easier ways to do this with external hardware, but for this size matrix it would be a waste of $. Yes, a $1-2 cpld will do the trick - but then you need a pcb etc…
Setup a periodic timer interrupt to sync your shifts / rows of data - take your refresh rate (suggest 55-80Hz to avoid flicker) & divide by # of rows (timer => 440 to 640 Hz / row)
Use the time between interrupts to setup your next buffer for display

Get someone here to help you with the timer/interrupt under Linux - I have no idea on that one. Would like to though - so maybe someone will respond with how to do that.

Hope that helps.

Good Luck,
Matt

David_Good · March 21, 2016, 11:36pm

Yes, I would love for someone to give me tips on setting up a periodic timer interrupt. What I have might be the only way to really do it, but I would assume not.

Thanks for your tips on ghosting. I will use it when I start to see real data in the buffers rather than my simple test patterns.

About using SPI. Yes, this is what I was doing on the Raspberry Pi, but haven’t done it on the Beaglebone yet. I will definitely give it a shot. Bit banging the clock and data looks like it’s costing me ~1ms per row as currently implemented.

On the topic of dedicated hardware offloading the workload, I suppose that this is exactly what the PRUs are designed for. Perhaps it’s time to say Hello World to one of them. Since this is a personal project, there aren’t any real design requirements, so any option is open to me right now. I was planning to layout a PCB to clean things up a bit because right now I have a hand soldered perf board with ribbon cables

Thanks for the suggestions! Let’s see if someone can answer about a kernel timer interrupt.

–David

William_Hermans · March 22, 2016, 5:12am

The way I see it is that you have two options. You could use the PRUs, or you could use the McSPI hardware module. I would attempt to help you but I really do not have any experience with the ws28xx style serial protocol that many seem to use with external lighting of this type.

I can tell you that a very simple, and cheap msp430g2553 project would be more that enough to do all this - bare metal. So no needs for a cpld or FPGA or anything fancy like that. But again as mentioned by Matt above, you have to design a PCB, + circuit and the costs add up despite the fact that a G2553 is only about $2-$3 sold as singles.

Anyway, maybe someone here knows the McSPI module well enough to comment on “shifting out” data to your matrix ? As in actually knowing what they’re talking about . . .

How does this buffer work anyhow ? Is this like a bit field in both dimensions, or what ? Kind of hard to grasp on a simple code glancing

CEinTX · March 22, 2016, 1:36pm

David,

William’s suggestion of the PRU is a really good one.
I know there are examples on the forum for using them - even though I personally haven’t looked at them.
You should be able to do everything inside the PRU - You only need 144 bytes per buffer (assuming 1 color / 1 bpp).
So even having multiple buffers shouldn’t be an issue. You should even still be able to leverage the SPI interface if desired.
I understand that bit-banging via the PRU is much more efficient than even bare metal inside the CPU.

The UNC5821 driver has a much simpler interface than the ws28xx serial lighting chips. It’s an old retired chip that is essentially a shift register and a latch.
Which makes it quite suitable for SPI as you really only need clk and data to send info to it.
It’s also not as timing dependent - 800KHz - 1 wire timing.
These parts are also forgiving if you get word boundary bound that you can always just push extra data through the chain - leaving valid data in the register
behind it. So for instance if you end up having to do 32-bit xfers into a 144 bit chain, you can shift out 160 bits (5 xfers) and just have the 1st 16 bits be
don’t care. The remaining 144 will remain in register and get xfer’d when latched.

This is a fairly high power chip - haven’t looked to see if there is a new version for new designs - so I’m expecting that you (David) must be
recycling an old display. If not, there are a bunch of similar chips out there - most are 16-bits wide - see TI, ST Micro, Silicon Touch, Macroblock, Toshiba…

Good Luck,
Matt