Weekly Progress Report Thread: Enhanced Media Experience with AI-Powered Commercial Detection and Replacement

Week 10 Updates:

  • Collected a 10GB dataset from a Set-Top Box, including content and commercials from news, Olympic games, and dramas.
  • Wrote the feature extraction code for videos, extracted audio-visual features, merged them, and performed inference. This code extracts features and pre-processes them so they can be fed into the model for inference.
Details of Visual Feature Extraction from Videos (Click to expand)
  • Load the InceptionV3 model, TFLite model, and PCA parameters.
  • Resize the frame, preprocess it, and extract features using InceptionV3.
  • Process the video in chunks of 150 frames and extract features.
  • Apply PCA, 8-bit quantize the features, and perform inference.
Details of Audio Feature Extraction from Videos (Click to expand)
  • Download the VGGish model, load the checkpoint, and freeze the graph for feature extraction.
  • Use FFmpeg to extract audio from the given video file.
  • Load the audio file, convert it to mono, and resample it to the target sample rate.
  • Compute the Short-Time Fourier Transform (STFT) and convert it to a log-mel spectrogram.
  • Load the frozen VGGish model and extract audio features from the spectrogram.
  • Normalize the extracted features and quantize them to 8-bit integers. They are now ready.

Workflow:
The video is loaded from its file, and frames are recorded at intervals of 250ms (i.e., 4 frames per second). After collecting 150 frames (150/4 = 37.5 seconds), inference is performed. The result of the inference is stored in a tuple, which contains the label, start frame and end frame at the rate the video is loaded (not at 4 frames per second to avoid a flickery output video). After performing inference and storing the results for all video frames in chunks of 150 frames, the output video is displayed. When the label for a frame is 0 (non-commercial), the frame is displayed as it is. However, when the label for a frame is 1 (commercial), a black screen is shown instead.

Note
Initially, when I followed this workflow and started performing inference on videos, I realized that extracting audio features was taking significantly longer compared to visual features. For example, for a 30-second video, visual features were extracted in 30-35 seconds, but audio features took 5 minutes for the same. Therefore, I decided to exclude audio features, as the accuracy was similar with or without them. I trained a new CNN model using only visual features (shape: (150, 1024)). (1152-128 = 1024). The results below are based on that model.

Results:
I tested the complete pipeline on three videos. One was a random commercial video downloaded from YouTube, another was a non-commercial news video, and the last was a mix of commercial and non-commercial content (drama + commercial). (Small chunks of compressed videos are included with the results.)

  • Results on a random commercial-video:
    Video length: 150 seconds.
    Processing time: 151 seconds.
    Accuracy: 80%

  • Results on non-commercial video(news):
    Video length: 176 seconds.
    Processing time: 186 seconds
    Accuracy: 80%

  • Results on Mix Video(dramas+Commercial):
    Video length: 30 mins
    Processing time: 33-34 mins
    Accuracy: 65-70%
    Here, see how the transition at 1:20 happens when commercial gets over and drama get started.

  • Above video after post-processing:

(Sorry for the bad quality of videos, but I had to compress the videos to 2% of its original size so I could post it here)

There is a slight decrease in FPS, but it is barely noticeable.
I’m currently running this on my system and will be testing it on the BeagleBone AI-64 next.