Enhanced Media Experience with AI-Powered Commercial Detection and Replacement

Deep-learning based commercial detection and real-time replacement

Detailed summary below.

Goal: Build a deep-learning model, training data set and training scripts, and a run-time for detection and modification of the video stream
Hardware Skills: Ability to capture and display video streams
Software Skills: Python, TensorFlow, GStreamer, OpenCV
Possible Mentors: @jkridner, @lorforlinux
Expected size of project: 350 hour
Rating: medium
Upstream Repository: TBD

  • TBD

Project Overview

This idea proposal was aided by ChatGPT-4.

This project aims to develop an innovative system that uses neural networks for detecting and replacing commercials in video streams on BeagleBoard hardware. Leveraging the capabilities of BeagleBoard’s powerful processing units, the project will focus on creating a real-time, efficient solution that enhances media consumption experiences by seamlessly integrating custom audio streams during commercial breaks.


  • Develop a neural network model: Combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to analyze video and audio data, accurately identifying commercial segments within video streams.
  • Implement a GStreamer plugin: Create a custom GStreamer plugin for BeagleBoard that utilizes the trained model to detect commercials in real-time and replace them with alternative content or obfuscate them, alongside replacing the audio with predefined streams.
  • Optimize for BeagleBoard: Ensure the entire system is optimized for real-time performance on BeagleBoard hardware, taking into account its unique computational capabilities and constraints.

Technical Approach

  • Data Collection and Preparation: Use publicly available datasets to train the model, with manual marking of commercial segments to create a robust dataset.
  • Model Training and Evaluation: Utilize TensorFlow to train and evaluate the CNN-RNN model, focusing on achieving high accuracy while maintaining efficiency for real-time processing.
  • Integration with BeagleBoard: Develop and integrate the GStreamer plugin, ensuring seamless operation within the BeagleBoard ecosystem for live and recorded video streams.

Expected Outcomes

  • A fully functional prototype capable of detecting and replacing commercials in video streams in real-time on BeagleBoard.
  • A comprehensive documentation and guide on deploying and using the system, contributing to the BeagleBoard community’s knowledge base.
  • Performance benchmarks and optimization strategies specific to running AI models on BeagleBoard hardware.

Potential Challenges

  • Balancing model accuracy with the computational limitations of embedded systems like BeagleBoard.
  • Ensuring real-time performance without significant latency in live video streams.

Community and Educational Benefits

This project will not only enhance the media consumption experience for users of BeagleBoard hardware but also serve as an educational resource on integrating AI and machine learning capabilities into embedded systems. It will provide valuable insights into:

  • The practical challenges of deploying neural network models in resource-constrained environments.
  • The development of custom GStreamer plugins for multimedia processing.
  • Real-world applications of machine learning in enhancing digital media experiences.

Mentorship and Collaboration

Seeking mentorship from experts in the BeagleBoard community, especially those with experience in machine learning, multimedia processing, and embedded system optimization, to guide the project towards success.

When submitting this as a GSoC project proposal, ensure you clearly define your milestones, deliverables, and a timeline. Additionally, demonstrating any prior experience with machine learning, embedded systems, or multimedia processing can strengthen your proposal. Engaging with the BeagleBoard.org community through forums or mailing lists to discuss your idea and gather feedback before submission can also be beneficial.


I think we can achieve real-time high performance by leveraging BeagleBoards hardware accelerators such as GPU, DSP, or AI accelerators. We can use TIDL or OpenCL for optimization. We can maybe offload computationally intensive operations, such as neural network inference, to accelerate real-time processing. Am I thinking on the right track and can someone please guide me with some resources?

I believe you are thinking down the right track, but I feel the most challenging aspects aren’t running the inference engine, but designing the model architecture (time component? audio component? etc.), training the model and integrating into a real-time framework that will make adjustments to the live video stream.

Have you been able to install that on linux, they have Vcpkg but it a MS sponsored tool… It looks like they are steering users towards Vcpkg since the most obvious choice would be cmake.

TI has gone away from using OpenCL for their machine learning accelerators. They used it on TDA3/AM5, but on TDA4/AM6 devices they are using OpenVX.

Probably best to focus on the Tensorflow Lite, ONNX, etc. interfaces.


I researched model architecture and found out that there is no past research on Commercial detection in videos. But, then I thought that there are ad blockers like Ublock Origin of Chrome and others, how do they block ads from videos, also there is YouTube Vance which has an inbuilt ad-blocker that blocks ads, and then I found out that none of these blockers use ml for identifying and blocking Commercials.
How do they block commercials then if not ml? This is what I found out:-
Ad blockers intercept network requests made by web browsers and analyze web page content to identify and block elements associated with advertising, such as ad scripts, images, and iframes. They use filter lists containing patterns or rules to determine which content to block, preventing ads from being displayed or executed in the browser.
So, If we want to use ml for commercial detection that would be a totally new task in Machine learning and I think a fun one.
At a high level, I think we can achieve this by combining CNNs for spatial features in the video, RNNs like LSTMs for temporal features in audio, and fusing them using attention mechanisms and then Fully connected layers to perform final classification, utilizing sigmoid activation for binary classification.
If needed, we can also find audio transcripts using Whisper V3 and then do topic modeling. This could give additional context for commercial detection.

Yes, you can install OpenCL on Linux using native package managers like apt on Debian-based distributions or others specific to your Linux distribution. While Vcpkg is more commonly associated with Microsoft and primarily used on Windows, it’s possible to use it on Linux as well. However, for OpenCL, the standard approach on Linux is to use native package managers for installation.

1 Like

I will also research Visual transformers to see if we can use them for commercial detection.
Am I on the right track?

We don’t need OpenCL. Just focus on Tensorflow.


So we are on the same page, I am assuming this is the correct starting point.

1 Like

That is a good starting point for model execution on an embedded device. BeagleBone AI-64 has its own version of TensorFlow Lite to access its native accelerators, but the difference is how it gets installed, not how to run it. If you install and run TensorFlow Lite in any way that works, you’ll be on the right path for model execution. The Edge AI BeagleBone AI-64 images have TensorFlow Lite already installed with acceleration enabled.

You can also look at the way the TI Edge AI code integrated with GStreamer, which is useful. I’d say you wouldn’t want to build flows the way they do with .yaml files, but the underlying python and gstreamer plugins are indeed useful.

Model architecture and training are separate from executing the model. I think figuring out how you are going to structure the model and build up a set of training data is the most significant part of this project.

1 Like

Thank you for that, to keep everyone on the same page please post links to the tools you feel will function together.

Is this found in the TI sdk for the TDA4?

@jkridner, Considering the lack of existing work on the Commercial detection model. Can we consider this model architecture and this project research-based? As we design and implement our model architecture, we will likely need to experiment with various approaches, parameters, and techniques to find what works best for commercial detection. This iterative process of experimentation is characteristic of research.

SDK for TDA4


There is certainly some amount of research to be done, but I would think of this project as more about developing frameworks. I think we will largely leave to others defining what the ideal network is and how to generate sufficient training data, but we should tackle the “just enough” approach regarding the model generation and establish some tools future researchers could use to tackle this problem and see real-time results, perhaps even introducing some level of reinforcement learning during real-time usage.

Take a look at existing work on video classification, such as:

If you look at the training videos and model they use, it seems to me that it should be possible to accumulate some amount of open data of live video streams and that includes commercial segments to train this same model regarding some various show types and commercial styles.

Perhaps a good starting point is just running TensorFlow’s video classification model as-is, but adding the integration around it to transform live video streams based on specific classifications. Imagine that we’ve specified we want “Sports” classifications to stream as-is and all other categories to be blurred and audio replaced with our own audio content. Can we put together such a system and use it to watch “Sports” without disruption?

So, yes, there is a real amount of research, but let’s not get bogged down in writing a thesis. Create an open source tool that allows applications of video classification models to real-time data streams (a few seconds of lag is fine, but we want to avoid hiccups in the stream) for the purpose of encouraging further research and training, as well as introducing new people to real-time video processing.

Does this help? The research side of things is very open-ended. The execution of this GSoC contribution will have a very strict deadline. A set of open source code where others can reasonably-easy reproduce (ie., CI-based build, apt install, my-cool-demo) real-time model execution on live video streams is the requirement.

We’ll want to use the Debian-based SDK, not TI’s.

See Edge AI — BeagleBoard Documentation

We are continuously trying to educate TI software developers about software. They are embedded developers typically trained in electrical engineering, so you have to cut them a bit of slack for not being computer scientists working with experts in DevOps and applications development through their careers. Somehow, they thought it was alright to call executable bits compiled against a Yocto run-time from an Ubuntu run-time running in a Docker container. Eventually, they’ll figure out they just need to release source we can compile within Ubuntu and Debian directly.

They’ve started to put a bunch of stuff up at Texas Instruments · GitHub. However, this script right here shows they haven’t found a clue yet:

Some highlights…

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1


But, that’s nothing compared to this gem:

cd /usr/lib/
if [  ! -L libti_rpmsg_char.so.0 ];then
    ln -s /host/usr/lib/libti_rpmsg_char.so
    ln -s /host/usr/lib/libti_rpmsg_char.so.0

If you are scratching your head, I’m proud of you. You probably never thought of this little number used to create the container from the image:

    docker run -it \
        -v /dev:/dev \
        -v /opt:/opt \
        -v /:/host \
        -v /mnt:/mnt \
        --privileged \
        --network host \
        edgeai_tidl_tools_u20 $USE_PROXY

:man_facepalming: :man_facepalming: :man_facepalming: :punch:

That was what I was talking about regarding calling Yocto binaries from Ubuntu.

And, of course, none of that script is performed inside of Dockerfile to exist within the original container image. Nope. You are supposed to “Run the commands inside container”:

cd /host/<path_to_edge_ai_tidl_tools>/dockers/ubuntu_20.04
source container_stup.sh # This will take care of additional dependencies  (opencv not included)
source container_stup.sh --opencv_install 1 # This will take care of additional dependencies  (opencv included)

I guess they didn’t try those instructions, but one can imagine what is expected.

Also, every one of the interesting packages is binary in this setup pulled from a TI server, so you cannot build it on another system, like Debian.

None of it has packages, so you cannot uninstall it.

It isn’t all doom-and-gloom as they have provided instructions for rebuilding much of those bits within qemu. I guess they aren’t aware of very powerful Arm machines being readily available? Anyway, you can find those instructions at:

I can’t make much sense of it, but I can see it is there.

We unrolled this mess once to build the 8.2 bits on Debian. It should be enough for this project. I’m hopeful we’ll get this cleared up for newer releases.

1 Like

That is me too!!

TI really needs to get the their act together, it is really critical to have a vendor that has roots in the USA, pretty sure you understand and no need to comment on that.

Thanks, @jkridner, this helps a lot.
I will start looking into the resources you provided.

After going through various video classification models(MoViNets, Conv3D, Conv2D, ResNet50, Conv+LSTM), I found that MoViNet-(A0-A2)Stream would be a great choice for the project as they represent the smaller models that can feasibly run in real-time. Other models like Conv3D are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on edge devices, Conv2D on the other hand has low accuracy as they do not process temporal features of videos. MoViNets maintains the same accuracy as Conv3D with its low computation capability, making it compatible with running on edge devices.

I also went through many datasets in search of the Commericals category and eventually, I found YouTube-8M Dataset to be a good choice as it includes Television Advertisement as one of its categories. We can filter the dataset to specifically focus on the advertisement-related videos based on their labels. This article explains the feature extraction technique in detail. In short, they used a pre-trained network to reduce the dimensionality of the video to 1024 features per frame. They also included 128 dimensionality feature vector for audio by compressing it into a spectrogram and then doing STFT.

We can fine-tune the MoViNet model using transfer learning as mentioned here. MoViNet gives good results in live streaming and is good with edge devices.
So, once our model is ready, building a GStreamer wouldn’t take much time mostly as I have done image processing based on image classification in one of my past project using OpenCV. The only difference in Video processing is their temporal dependencies and audio component from the point of view of Deep learning.
We can also consider making our own dataset of Commercial videos and normal videos, as our problem is of binary classification(Commercial or Non-Commercial), so that would also work well.

If you could share your thoughts on my current ideas for the implementation ideas, that would be great! I would really appreciate your guidance on how to proceed forward.

Looking forward to your response!

1 Like

@jkridner and @foxsquirrel, I found the idea of removing TV Commercials very fascinating. Indeed, this is a real issue. Having a choice in the content we consume is important, and it’s understandable to seek alternatives when the current offerings don’t align with personal preferences.
I am ready to invest more time into this idea. But, this idea is not listed in the official BeagleBoards idea list and therefore I’m confused if it will be a mentored project for GSoC’24 moreover there has been inactivity in this channel for the past 5 days.
@jkridner, I implemented the Sports classification model (you mentioned above as a good starting point) using ResNet-50 architecture and then did the image processing based on image classification using OpenCV. You can see the results here. Here, only sports classified as swimming will be played perfectly and if any other sports are detected then it will get blurred out.
Link of Github:- GitHub - AryanNanda17/VideoProcessing-Based-on-Video-classifcation
Can you please confirm the status of this project? If no mentor would be assigned to this project then I would invest my time into other projects.