Cleaned Dataset Log Summary & Script Benefits

Hi all,
As discussed on Discord, I’ve run the full cleaning pipeline on the BeagleBoard dataset created by @fayezzouari sir, and here’s a breakdown of the results from the generated logs along with the benefits each stage brings:

[2026-01-02 17:19:23,272] INFO - Starting pipeline. Reading input from: c:\Users\Sanchit\OneDrive\Desktop\Hugging-Face Beagleboard\beagleboard-docs\Cleaning Scripts\multi-turn-conversations-t7ESxvRZkHn2-2025-12-07.json
[2026-01-02 17:19:23,947] INFO - Ingested 5674 raw conversations.
[2026-01-02 17:19:23,949] INFO - System prompt normalization complete: 0 inserted, 0 replaced.
[2026-01-02 17:19:23,949] INFO - Normalized system prompts.
[2026-01-02 17:19:23,949] INFO - Starting deduplication on 5674 conversations...
[2026-01-02 17:19:24,037] INFO - Removed 0 exact duplicates.
[2026-01-02 17:19:25,789] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:19:25,789] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:19:56,454] INFO - Removed 629 semantically similar duplicates using threshold 0.9.
[2026-01-02 17:19:56,454] INFO - Deduplication complete. Final count: 5045 conversations.
[2026-01-02 17:19:56,511] INFO - Data size after deduplication: 5045
[2026-01-02 17:19:56,511] INFO - Running quality filter on 5045 conversations...
[2026-01-02 17:19:56,814] INFO - Dropped 0 conversations due to missing user/assistant roles.
[2026-01-02 17:19:56,814] INFO - Dropped 2213 due to low-quality assistant replies.
[2026-01-02 17:19:56,814] INFO - Dropped 4 due to very short user prompts.
[2026-01-02 17:19:56,814] INFO - Quality filter retained 2828 valid conversations.
[2026-01-02 17:19:56,814] INFO - Data size after quality filtering: 2828
[2026-01-02 17:19:56,814] INFO - Tagging metadata for 2828 conversations...
[2026-01-02 17:19:56,815] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:19:56,815] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:20:43,572] INFO - Tagging complete. Total tags assigned across all conversations: 12779
[2026-01-02 17:20:43,572] INFO - Tags assigned to conversations.
[2026-01-02 17:20:43,572] INFO - Validating 2828 conversations with a max message limit of 10...
[2026-01-02 17:20:43,577] INFO - Validation complete. 2828 conversations passed the checks.
[2026-01-02 17:20:43,577] INFO - Data size after validation: 2828
[2026-01-02 17:20:43,577] INFO - Splitting 2828 conversations into chunks of max 2 QA pairs each...
[2026-01-02 17:20:43,587] INFO - Total split conversations created: 5656
[2026-01-02 17:20:43,587] INFO - Data size after multi-turn splitting: 5656
[2026-01-02 17:20:43,587] INFO - Tagging metadata for 5656 conversations...
[2026-01-02 17:20:43,588] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:20:43,588] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:21:51,766] INFO - Tagging complete. Total tags assigned across all conversations: 20048
[2026-01-02 17:21:51,766] INFO - Tags reassigned after splitting.
[2026-01-02 17:21:52,084] INFO - Exported cleaned dataset to: c:\Users\Sanchit\OneDrive\Desktop\Hugging-Face Beagleboard\beagleboard-docs\Cleaning Scripts\clean_multi-turn-conversations-t7ESxvRZkHn2-2025-12-07.json
[2026-01-02 17:21:52,084] INFO - Final dataset contains 5656 conversation samples.  

Cleaning Pipeline Summary (from logs)

  • Raw conversations ingested: 5674
  • Exact duplicates removed: 0
  • Semantic duplicates removed: 629
    Ensures unique content using cosine similarity with SentenceTransformer (> 0.9 threshold)
  • Low-quality assistant responses dropped: 2213
    Filtered generic, vague, or very short assistant replies (e.g., “sorry”, “I don’t know”, etc.)
  • Short user prompts dropped: 4
  • Remaining high-quality conversations: 2828
  • Conversations split (max 2 QA pairs each): 5656
    Maintains brevity, as per guidance, ensuring each sample is focused and self-contained
  • Tags assigned (before + after split): 20048 total
    All tags are sourced directly from beagleboard.org and official forum threads for optimal searchability

Benefits of the Cleaning Process

  1. Improved Data Quality
    Poor responses and semantic duplicates are eliminated, retaining only strong, relevant QA pairs.
  2. Semantic Tagging
    Tags help models better route queries and reduce lookup costs. They’re also helpful for downstream dataset filtering and evaluation.
  3. Scalable + Reproducible
    Each step is logged in detail, so the impact of every stage is transparent and easy to validate.
  4. Optimized for Training
    Splitting long threads into 2-pair conversations makes the data ideal for fine-tuning chat models with multi-turn capabilities.

You can also view the dataset on the HF here SanchitKS12/beagleboard-docs at cleaned-dataset-pr
Looking forward to any feedback or suggestions !!
Thanks alot !!

— Sanchit