Weekly Progress Report Thread: Enhanced Media Experience with AI-Powered Commercial Detection and Replacement

Aryan_Nanda · August 15, 2024, 3:52pm

Week 10 Updates:

Collected a 10GB dataset from a Set-Top Box, including content and commercials from news, Olympic games, and dramas.
Wrote the feature extraction code for videos, extracted audio-visual features, merged them, and performed inference. This code extracts features and pre-processes them so they can be fed into the model for inference.

Details of Visual Feature Extraction from Videos (Click to expand)

Load the InceptionV3 model, TFLite model, and PCA parameters.
Resize the frame, preprocess it, and extract features using InceptionV3.
Process the video in chunks of 150 frames and extract features.
Apply PCA, 8-bit quantize the features, and perform inference.

Details of Audio Feature Extraction from Videos (Click to expand)

Download the VGGish model, load the checkpoint, and freeze the graph for feature extraction.
Use FFmpeg to extract audio from the given video file.
Load the audio file, convert it to mono, and resample it to the target sample rate.
Compute the Short-Time Fourier Transform (STFT) and convert it to a log-mel spectrogram.
Load the frozen VGGish model and extract audio features from the spectrogram.
Normalize the extracted features and quantize them to 8-bit integers. They are now ready.

Workflow:
The video is loaded from its file, and frames are recorded at intervals of 250ms (i.e., 4 frames per second). After collecting 150 frames (150/4 = 37.5 seconds), inference is performed. The result of the inference is stored in a tuple, which contains the label, start frame and end frame at the rate the video is loaded (not at 4 frames per second to avoid a flickery output video). After performing inference and storing the results for all video frames in chunks of 150 frames, the output video is displayed. When the label for a frame is 0 (non-commercial), the frame is displayed as it is. However, when the label for a frame is 1 (commercial), a black screen is shown instead.

Note
Initially, when I followed this workflow and started performing inference on videos, I realized that extracting audio features was taking significantly longer compared to visual features. For example, for a 30-second video, visual features were extracted in 30-35 seconds, but audio features took 5 minutes for the same. Therefore, I decided to exclude audio features, as the accuracy was similar with or without them. I trained a new CNN model using only visual features (shape: (150, 1024)). (1152-128 = 1024). The results below are based on that model.

Results:
I tested the complete pipeline on three videos. One was a random commercial video downloaded from YouTube, another was a non-commercial news video, and the last was a mix of commercial and non-commercial content (drama + commercial). (Small chunks of compressed videos are included with the results.)

Results on a random commercial-video:
Video length: 150 seconds.
Processing time: 151 seconds.
Accuracy: 80%
Results on non-commercial video(news):
Video length: 176 seconds.
Processing time: 186 seconds
Accuracy: 80%

Results on Mix Video(dramas+Commercial):
Video length: 30 mins
Processing time: 33-34 mins
Accuracy: 65-70%
Here, see how the transition at 1:20 happens when commercial gets over and drama get started.
Above video after post-processing:

(Sorry for the bad quality of videos, but I had to compress the videos to 2% of its original size so I could post it here)

There is a slight decrease in FPS, but it is barely noticeable.
I’m currently running this on my system and will be testing it on the BeagleBone AI-64 next.