Enhanced Media Experience with AI-Powered Commercial Detection and Replacement

foxsquirrel · February 28, 2024, 12:26pm

Please post a screen shot, youtube blocks me from viewing it.

jkridner · February 28, 2024, 3:02pm

“I’m confused if it will be a mentored project”

I am interested in mentoring, but I am looking for support from established developers in the Beagle community.

foxsquirrel · February 28, 2024, 3:35pm

What board do you want to use for this?
Do you know if TI is planning to release a Vulkan driver for TDA4VM?

jkridner · February 28, 2024, 4:13pm

BeagleBone AI-64 makes the most sense to me. It would be good to add BeagleV-Ahead with its DL acceleration, but that isn’t quite as mature right now.

Don’t think so, but I don’t think we need it. The TFLite or OpenVX interface should be enough. They have GStreamer plugins we’ve integrated into our Debian image as well.

Aryan_Nanda · February 28, 2024, 4:48pm

I have compressed the video because of the 4MB uploading limit in the forum, so the quality has become low. For better quality refer to the link below. You may be able to access it now.

Timestamp of transitions:

3:50
4:35
5:55
Youtube video link → https://youtu.be/hoKE2dr2nT4

Aryan_Nanda · February 28, 2024, 5:03pm

I downloaded random YouTube videos of different sports and then merged them using an online tool and then performed inference along with video processing. Only swimming videos will displayed as is and rest will get blurred.
The inference time is slow because I didn’t use any accelerator, but by using the BeagleBoard Ai-64 hardware accelerator through TFlite we can achieve a good inference time.

foxsquirrel · February 28, 2024, 5:23pm

That is exactly what I thought.

Aryan_Nanda · March 9, 2024, 12:59pm

Here, it should be Caffe(Convolutional Architecture for Fast Feature Embedding), right?

jkridner · March 9, 2024, 9:47pm

I might have confused it with Keras.

How did you train your model, ie. what did you do to find optimizers?

https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough

Aryan_Nanda · March 10, 2024, 9:39am

I have used SGD(Stochastic Gradient Descent) optimizer with momentum = 0.9 and learning rate = 0.0001.

Aryan_Nanda · March 10, 2024, 9:43am

SGD is known for its stability during training, which is beneficial when fine-tuning pretrained models.

jkridner · March 11, 2024, 4:46pm

Have you made any progress on your proposal? Do you have thoughts on next steps?

Aryan_Nanda · March 12, 2024, 9:46am

Hello @jkridner,
I am currently in the middle of my mid-semester exams. I will begin actively working on my proposal starting this Saturday.

I think figuring out the pipeline for video processing based on video classification using GStreamer and OpenCV would be a good next step. At a high level it should be something like this:-

Capture frames from videos using OpenCV-Python.
Do some processing on it and then inferencing it using the ML model.
Blur the original frame if it is a commercial.
stream these frames via GStreamer RTSP, using OpenCV.
open vlc player to watch the real-time frames.
Here, BeagleBone AI-64 will be used for accelerated real-time inference of the ML model
What do you think about this?

jkridner · March 12, 2024, 11:04am

| Aryan_Nanda
March 12 |

| - |

.

jkridner:

Do you have thoughts on next steps?

I think figuring out the pipeline for video processing based on video classification using GStreamer and OpenCV would be a good next step. At a high level it should be something like this:-

Capture frames from videos using OpenCV-Python.

Because we want real-time, we might want to take OpenCV/Python out if the critical path. If we have a pure C++ GStreamer pipe from input to output, that might be better. There should be a way to fetch video frames and the audio stream and to trigger the blurring and audio replacement.

A USB3 HDMI capture dongle is probably the easiest real-time input.

Do some processing on it and then inferencing it using the ML model.

Blur the original frame if it is a commercial.

stream these frames via GStreamer RTSP, using OpenCV.

Why RTSP? We should render the video locally.

open vlc player to watch the real-time frames.

Why not render the frames to DisplayPort from GStreamer?

Aryan_Nanda · March 13, 2024, 8:01am

• Opting for a pure C++ GStreamer pipeline could indeed enhance real-time performance significantly.
• Since we are using pure C++ pipeline, then for inferencing we would have to use TFlite API reference for C++.
I have experience of working with C++ Codebases. Recently, I merged this PR in one repository which utilizes OpenCV in C++.

Rendering the frames directly to DisplayPort from GStreamer does indeed seem like a more efficient approach for viewing real-time frames. It eliminates the need for setting up an RTSP server and configuring external players like VLC, simplifying the setup process and reducing potential latency. I will do more research on it.

Aryan_Nanda · March 13, 2024, 8:37am

What are your views on this GStreamer Pipeline:-

Using the v4l2src element for grabbing video frames from a USB3 HDMI.
Doing Video Prepossessing using elements like videoscale and videoconvert .
Passing the frames to the appsink name=videosink which acts as a sink for video and appsink name=audiosink which acts as a sink for audio.
Loading and inferencing using tensorflowlite model=model.tflite input=frame output=class_probabilities.
We can do parallel processing using the tee element. One branch of the pipeline can be dedicated to commercial detection, where the ML model processes the video frames to identify commercial segments. Another branch can handle the rendering of the video stream to display the output locally.
After inferencing, we can do postprocessing and then we will render the output frame.

Aryan_Nanda · March 18, 2024, 1:23am

Hello, @jkridner, @lorforlinux and other GSOC mentors!
Please review my proposal here.
Tell me what you guys think about my ideas and if I need to do more research in some areas. Also, check out the timeline I have mentioned and suggest changes if any.

Aryan_Nanda · March 21, 2024, 3:57pm

Since NNStreamer is specifically designed for efficient and flexible data streaming for machine learning applications.
So I recommend for our task a combination of GStreamer and NNStreamer might be appropriate. We can use GStreamer for video playback and video-audio processing and utilize NNStreamer for commercial detection using neural networks. Since NNStreamer is optimized for machine learning tasks and provides efficient data streaming mechanisms, it can be beneficial for running inferences on video frames to detect commercials.

Please give your thoughts on this:-

Use GStreamer to read the video file and extract individual frames.
Process each frame using NNStreamer to perform commercial detection using a pre-trained neural network model.
Based on the commercial detection results, apply the necessary actions (e.g., blur video frames and replace audio for commercial segments).
Use GStreamer to display the processed video with the applied effects.

If you guys also like this idea, I will go ahead and add this to my GSOC proposal as well.

foxsquirrel · March 21, 2024, 4:15pm

Just cut the commercial out completely and splice the ends. Assume 5 minutes is the max time you will ever be exposed to commercials, so buffer then cut and then play with an overall delay of 5 minutes.

This way you will have a continuous experience, segmented by a commercial or blank/fuzzy screen would be the same annoyance to the end user.

The AI64 can handle the 5 or so minutes of buffering when you have a NVMe on the board.

foxsquirrel · March 22, 2024, 12:34am

Also, if the project does get awarded to you make sure you use features of the AI64 SoC that are only available on the Ti chips.