Hi all,
As discussed on Discord, I’ve run the full cleaning pipeline on the BeagleBoard dataset created by @fayezzouari sir, and here’s a breakdown of the results from the generated logs along with the benefits each stage brings:
[2026-01-02 17:19:23,272] INFO - Starting pipeline. Reading input from: c:\Users\Sanchit\OneDrive\Desktop\Hugging-Face Beagleboard\beagleboard-docs\Cleaning Scripts\multi-turn-conversations-t7ESxvRZkHn2-2025-12-07.json
[2026-01-02 17:19:23,947] INFO - Ingested 5674 raw conversations.
[2026-01-02 17:19:23,949] INFO - System prompt normalization complete: 0 inserted, 0 replaced.
[2026-01-02 17:19:23,949] INFO - Normalized system prompts.
[2026-01-02 17:19:23,949] INFO - Starting deduplication on 5674 conversations...
[2026-01-02 17:19:24,037] INFO - Removed 0 exact duplicates.
[2026-01-02 17:19:25,789] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:19:25,789] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:19:56,454] INFO - Removed 629 semantically similar duplicates using threshold 0.9.
[2026-01-02 17:19:56,454] INFO - Deduplication complete. Final count: 5045 conversations.
[2026-01-02 17:19:56,511] INFO - Data size after deduplication: 5045
[2026-01-02 17:19:56,511] INFO - Running quality filter on 5045 conversations...
[2026-01-02 17:19:56,814] INFO - Dropped 0 conversations due to missing user/assistant roles.
[2026-01-02 17:19:56,814] INFO - Dropped 2213 due to low-quality assistant replies.
[2026-01-02 17:19:56,814] INFO - Dropped 4 due to very short user prompts.
[2026-01-02 17:19:56,814] INFO - Quality filter retained 2828 valid conversations.
[2026-01-02 17:19:56,814] INFO - Data size after quality filtering: 2828
[2026-01-02 17:19:56,814] INFO - Tagging metadata for 2828 conversations...
[2026-01-02 17:19:56,815] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:19:56,815] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:20:43,572] INFO - Tagging complete. Total tags assigned across all conversations: 12779
[2026-01-02 17:20:43,572] INFO - Tags assigned to conversations.
[2026-01-02 17:20:43,572] INFO - Validating 2828 conversations with a max message limit of 10...
[2026-01-02 17:20:43,577] INFO - Validation complete. 2828 conversations passed the checks.
[2026-01-02 17:20:43,577] INFO - Data size after validation: 2828
[2026-01-02 17:20:43,577] INFO - Splitting 2828 conversations into chunks of max 2 QA pairs each...
[2026-01-02 17:20:43,587] INFO - Total split conversations created: 5656
[2026-01-02 17:20:43,587] INFO - Data size after multi-turn splitting: 5656
[2026-01-02 17:20:43,587] INFO - Tagging metadata for 5656 conversations...
[2026-01-02 17:20:43,588] INFO - Use pytorch device_name: cuda:0
[2026-01-02 17:20:43,588] INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
[2026-01-02 17:21:51,766] INFO - Tagging complete. Total tags assigned across all conversations: 20048
[2026-01-02 17:21:51,766] INFO - Tags reassigned after splitting.
[2026-01-02 17:21:52,084] INFO - Exported cleaned dataset to: c:\Users\Sanchit\OneDrive\Desktop\Hugging-Face Beagleboard\beagleboard-docs\Cleaning Scripts\clean_multi-turn-conversations-t7ESxvRZkHn2-2025-12-07.json
[2026-01-02 17:21:52,084] INFO - Final dataset contains 5656 conversation samples.
Cleaning Pipeline Summary (from logs)
- Raw conversations ingested:
5674 - Exact duplicates removed:
0 - Semantic duplicates removed:
629
➤ Ensures unique content using cosine similarity with SentenceTransformer (> 0.9threshold) - Low-quality assistant responses dropped:
2213
➤ Filtered generic, vague, or very short assistant replies (e.g., “sorry”, “I don’t know”, etc.) - Short user prompts dropped:
4 - Remaining high-quality conversations:
2828 - Conversations split (max 2 QA pairs each):
5656
➤ Maintains brevity, as per guidance, ensuring each sample is focused and self-contained - Tags assigned (before + after split):
20048total
➤ All tags are sourced directly from beagleboard.org and official forum threads for optimal searchability
Benefits of the Cleaning Process
- Improved Data Quality
Poor responses and semantic duplicates are eliminated, retaining only strong, relevant QA pairs. - Semantic Tagging
Tags help models better route queries and reduce lookup costs. They’re also helpful for downstream dataset filtering and evaluation. - Scalable + Reproducible
Each step is logged in detail, so the impact of every stage is transparent and easy to validate. - Optimized for Training
Splitting long threads into 2-pair conversations makes the data ideal for fine-tuning chat models with multi-turn capabilities.
You can also view the dataset on the HF here SanchitKS12/beagleboard-docs at cleaned-dataset-pr
Looking forward to any feedback or suggestions !!
Thanks alot !!
— Sanchit