PocketBeagle 2 (AM62x). Issues Capturing Full BT.656 Frame Using PRU

mikhail_svirydau · January 12, 2026, 10:00pm

Dear Beagle Board community,

I am developing a system based on the AM62x platform running Linux (kernel 6.x). Using the PRU, I aim to capture an incoming digital video stream in BT.656 square pixel format (768×576). The stream provides one byte at a time at a frequency of 29.5 MHz, along with a corresponding clock signal, which I also read from the source device.

For testing purposes, I am currently using a PocketBeagle 2 board with the Debian 13 image (kernel 6.2). Through multiple experiments, I determined that this task requires use of the PRU. I successfully created a device tree overlay, enabled the first 8 pins on PRU0, and increased the PRU clock frequency from 200 MHz to 333 MHz, all of which I verified experimentally.

I then implemented a simple PRU firmware that waits for a rising edge on one of the PRU0 pins and, upon detection, reads the __R31 register and stores its value into a ring buffer, updating an index accordingly. Both the ring buffer and index reside in PRU shared memory. Additionally, I developed a basic kernel driver to map this shared memory into Linux user space, along with a user-space application that accesses it via mmap().

At this point, the firmware appears to be reading BT.656 data. For example, during active video, I consistently observe 0x604 bytes (CrYCbY) following the BT.656 sync sequence 0xFF 00 00 xx, where xx reliably corresponds to header values such as 0x80 or 0xC7. My application is able to read data from the ring buffer, so the overall data path seems functional.

However, I am encountering a significant issue: I do not appear to be capturing the full video data in the Linux application. Instead of the expected 288 lines per field (with two interlaced fields, 0 and 1), I am only seeing approximately 55–60 lines. The resulting image only vaguely resembles the expected output. Since the video source has no lens, I would expect at most a grayscale gradient or shadowing; however, even within these 55–60 lines, I can observe differences when the video core is covered or uncovered.

At times, it also appears that some bytes may be skipped during capture.

Given this situation, I would appreciate your guidance on the following:

Does my current approach seem viable for reliably capturing the full BT.656 frame?
Are there specific improvements or adjustments I should consider in the PRU firmware or user-space application to avoid data loss?
Would it be more appropriate to move the data reception logic from the PRU/user-space path into a kernel-level driver?

Thank you in advance for your time and support.

Best regards,
Mikhail

Juvinski · January 13, 2026, 4:08pm

Hi @mikhail_svirydau

It’s no clear to me if if you are reading 1 stream only or each pru pin will be reading a different or part of the frame.

To read a stream, I would recommend you to use the Ecap instead pru direct pin read, once with the ecap you can configure the triggering moment, read and reenable the ecap.

mikhail_svirydau · January 16, 2026, 1:59pm

Hi @Juvinski,

Thank you very much for your reply and suggestions.

As requested, here is a clearer description of the incoming signal and the problem I am facing.

The input consists of 9 signals: 8 data bits plus a clock (CLK). The CLK is a square wave at 29.5 MHz. The 8-bit data is valid on the rising edge of CLK. Therefore, the task is:

Detect the rising edge of the CLK signal
Read one byte from the remaining 8 data lines at that moment

The main challenge is how to do this reliably.

I initially considered using eCAP, but at 29.5 MHz this does not seem feasible. Using eCAP would require generating an interrupt to notify the ARM core that data is ready, and under Linux the interrupt latency makes capturing data at this rate impractical. Also, my understanding is that TI primarily targets eCAP for PWM measurement and general signal detection.

Using GPIO directly also does not work, as it is limited to well below 1 MHz. That leaves the PRU as the only viable option.

Strictly speaking, I do not need a generic “rising edge” detector. What I need is to detect the moment when CLK transitions from 0 to 1 and then immediately read the data. On the AM62x, the PRU runs at 333 MHz, giving about 3 ns per instruction. This is enough to:

Check that the CLK pin (for example, P2.20) is 0
Immediately check that it is 1

If both conditions are met in 2 successive instructions, then the signal transitioned from 0 to 1 within 3 ns. After that, I can read the remaining 8 PRU pins and store the byte into a ring buffer on shared memory.

Debugging PRU code is difficult, but I ran several experiments:

Rising-edge detection test
I implemented CLK “rising edge” detection and toggled another PRU pin (for example, P1.20) on each detected edge. This worked reliably. On the oscilloscope, I observed a stable square wave at 2 × 29.5 MHz, with no missed edges. So I believe the edge detection itself is correct.
Ring buffer test with constant data
I then implemented a ring buffer in shared memory. When I write constant or incremental values from the PRU (0x00, 0x01, 0x02, …), the ARM side reads them back exactly as written. This part also works correctly.
Real BT.656 data capture
With a real BT.656 video source (camera), I can see perfectly (on the ARM side) valid headers such as:
- 0xFF 0x00 0x00 0x80 (SAV)
- 0xFF 0x00 0x00 0xC7 (EAV)
I never see corrupted headers (for example, 0xFF 0xZZ 0xYY). However, I do not receive a complete BT.656 line.

In square-pixel mode, one BT.656 line should be (f.e. headers could be different):
- SAV header: 4 bytes (0xFF 0x00 0x00 0x80)
- Active video: 1536 bytes (e.g. 0x80 0xYY 0x80 0xZZ, …)
- EAV header: 4 bytes (0xFF 0x00 0x00 0xC7)
- Blanking video: 344 bytes (e.g. 0x80 0x10, …)
Total per line: 1888 bytes.

What I actually observe is:
- SAV header
- 896 bytes of active video
- EAV header
- 344 bytes of blanking video
Total: 1248 bytes.

In other cases, it looks like a SAV or EAV header followed by 1244 bytes of mixed active and blanking data, again totaling 1248 bytes. In all cases, 640 bytes per line are missing.

Because the headers are always correct, I assume the PRU is reading the bytes correctly. It seems more likely that something is going wrong when the ARM side reads from the ring buffer.

Do you have any suggestions on what might cause this behavior?

With best regards,
Mikhail

mikhail_svirydau · January 18, 2026, 2:18pm

Hi everyone,

I’d like to correct a mistake in my earlier post. In the first experiment, I stated that the stable square wave was at 2 × 29.5 MHz, but the correct frequency is 29.5 MHz / 2. This is because the output PRU pin is toggled.

Specifically, each time a transition of the CLK signal from 0 to 1 is detected on the input pin, the output pin is toggled.

Best regards,
Mikhail

Juvinski · January 18, 2026, 8:04pm

Hi @mikhail_svirydau ,

Understood your scenario.

What is appearing is something that I always suffer a little bit with PRU.

I believe your code is a sort of loop/for/whatelse.

You are looping and reading the pins in sequence. I don’t know internally maths or operations are you doing but let me fix a contecep. when you having a PRU running at 333 mhz this means that each instruction spent 1 cpu cycle, so if inside your code you do a simple if testing 2 values, this means that you will be loosing 1 cycle per reading and in 8 readinds you are loosing 8 cycles. I don’t know, but when you speak you are loosing 640 bytes, this means you are loosing 80 bytes per channel.

In my case, my problem was to generate output signals and some of them, instead use a timer + time approach I changed to a direct output of signals for each channel - was better loosing 16 cycles with direct approach than the timer + time - because my bits 1 and 0 is unbalanced - more cycles was demanded by bit 1 than 0.

If you can share your logical would be more clear to check that.

mikhail_svirydau · January 18, 2026, 10:26pm

Hi,

Below is my firmware loop. It is intentionally very simple, as it cannot exceed 8–9 assembly instructions per loop iteration. At 333 MHz (≈3 ns per instruction), this results in a worst-case execution time of about 30 ns per iteration:
9 × 3 ns = 27 ns, plus ~3 ns for memory access.

There are two “write to memory” instructions, each requiring three extra cycles, although some of this latency is partially hidden by ALU operations. Overall, the loop takes ~30 ns per cycle, which fits comfortably within the 33.9 ns period of a 29.5 MHz clock. Notably, based on my experiments, adding even a single additional NOP to this loop results in completely corrupted data.

do {
    // temp1 is set to 0 via a function call. If I use a literal instead,
    // clpru generates two instructions:
    // 1) load the literal into a register
    // 2) copy that register into idx (also a register).
    // idx, temp, and temp1 are all int16_t.

    idx = temp1;

    do {
        asm volatile (
            "||LL1||: QBBS ||LL1||, r31, 0x0c \n"
            "||LL2||: QBBC ||LL2||, r31, 0x0c \n"
        );

        idx_ptr = idx;
        ptr[idx] = (uint8_t)__R31;
        ++idx;

        // temp is assigned RING_SIZE (see note about temp1 above).
        // The condition idx < temp compiles into two instructions
        // (subtract + jump), whereas idx != temp compiles into one.
    } while (idx != temp);

} while (1);

As you can see, there is no need to cycle through individual pins. All pins can be sampled in parallel by simply reading the R31 register. This register also contains the value of the PRU pin carrying the CLK signal (P1.20, or bit 12 of R31 in my case). The code is a modification of the toggle_led example from the PRU lab and only reads pins—it does not write to them.

On the ARM side, the logic is the inverse of the firmware: it reads the ring buffer index and then consumes bytes from the ring buffer, advancing its own index until it catches up with the firmware index.

At this point, I’m not sure what is wrong. It may be a logic error, but if that were the case I would expect to see corrupted BT.656 headers—which I do not. Likewise, I would expect random byte patterns rather than the expected 08 XX 08 YY 08 ZZ … sequence. Instead, the data appears correct, except that each BT.656 line is consistently 640 bytes shorter than expected.

As a next step, I plan to conduct an experiment using a BBB rev A5A to generate both the CLK signal and the data in a controlled, “regular” way, and then observe where the data loss occurs. On this board the PRU runs at 200 MHz, giving 5 ns per instruction. I believe I can implement a 7-instruction loop per cycle, resulting in a ~35 ns loop period (≈28.6 MHz). While this is not exactly the designed 29.5 MHz, it should provide regular and reliable data transmission and allow me to isolate where the problem is occurring.

Best regards,
Mikhail