Week 10 Updates:
- Collected a 10GB dataset from a Set-Top Box, including content and commercials from news, Olympic games, and dramas.
- Wrote the feature extraction code for videos, extracted audio-visual features, merged them, and performed inference. This code extracts features and pre-processes them so they can be fed into the model for inference.
Details of Visual Feature Extraction from Videos (Click to expand)
- Load the InceptionV3 model, TFLite model, and PCA parameters.
- Resize the frame, preprocess it, and extract features using InceptionV3.
- Process the video in chunks of 150 frames and extract features.
- Apply PCA, 8-bit quantize the features, and perform inference.
Details of Audio Feature Extraction from Videos (Click to expand)
- Download the VGGish model, load the checkpoint, and freeze the graph for feature extraction.
- Use FFmpeg to extract audio from the given video file.
- Load the audio file, convert it to mono, and resample it to the target sample rate.
- Compute the Short-Time Fourier Transform (STFT) and convert it to a log-mel spectrogram.
- Load the frozen VGGish model and extract audio features from the spectrogram.
- Normalize the extracted features and quantize them to 8-bit integers. They are now ready.
Workflow:
The video is loaded from its file, and frames are recorded at intervals of 250ms (i.e., 4 frames per second). After collecting 150 frames (150/4 = 37.5 seconds), inference is performed. The result of the inference is stored in a tuple, which contains the label, start frame and end frame at the rate the video is loaded (not at 4 frames per second to avoid a flickery output video). After performing inference and storing the results for all video frames in chunks of 150 frames, the output video is displayed. When the label for a frame is 0 (non-commercial), the frame is displayed as it is. However, when the label for a frame is 1 (commercial), a black screen is shown instead.
Note
Initially, when I followed this workflow and started performing inference on videos, I realized that extracting audio features was taking significantly longer compared to visual features. For example, for a 30-second video, visual features were extracted in 30-35 seconds, but audio features took 5 minutes for the same. Therefore, I decided to exclude audio features, as the accuracy was similar with or without them. I trained a new CNN model using only visual features (shape: (150, 1024)). (1152-128 = 1024). The results below are based on that model.
Results:
I tested the complete pipeline on three videos. One was a random commercial video downloaded from YouTube, another was a non-commercial news video, and the last was a mix of commercial and non-commercial content (drama + commercial). (Small chunks of compressed videos are included with the results.)
-
Results on a random commercial-video:
Video length: 150 seconds.
Processing time: 151 seconds.
Accuracy: 80%
-
Results on non-commercial video(news):
Video length: 176 seconds.
Processing time: 186 seconds
Accuracy: 80%
-
Results on Mix Video(dramas+Commercial):
Video length: 30 mins
Processing time: 33-34 mins
Accuracy: 65-70%
Here, see how the transition at 1:20 happens when commercial gets over and drama get started.
-
Above video after post-processing:
(Sorry for the bad quality of videos, but I had to compress the videos to 2% of its original size so I could post it here)
There is a slight decrease in FPS, but it is barely noticeable.
I’m currently running this on my system and will be testing it on the BeagleBone AI-64 next.