Fine-tuning LLMs and building few-shot prompts require clean, well-structured data. But real-world datasets are messy — inconsistent formatting, embedded PII, broken encodings, and wasted tokens. Here's how to prepare your data properly.
Why data cleaning matters for fine-tuning
The quality of your fine-tuned model directly reflects the quality of your training data. Common problems:
- PII in training data: Customer names, emails, and phone numbers baked into your model's weights — a compliance nightmare
- Inconsistent formatting: Mixed JSON structures confuse the model during training
- Wasted tokens: Verbose fields, redundant whitespace, and unnecessary metadata inflate costs without adding signal
- Encoding issues: UTF-8 BOM markers, Unicode replacement characters, and escaped entities corrupt text
Cleaning before training prevents all of these.
Cleaning JSON for fine-tuning
Step 1: Validate and normalize structure
Most LLM fine-tuning expects JSONL (JSON Lines) format — one JSON object per line. Use CleanMyPrompt's JSON cleaner to validate and normalize your data.
Common issues the tool catches:
- Trailing commas (invalid JSON)
- Single quotes instead of double quotes
- Unescaped special characters
- Inconsistent whitespace
Step 2: Strip PII from training records
If your fine-tuning data comes from customer interactions, it contains PII.
Before cleaning:
{"prompt": "Customer john.smith@acme.com called about order #12345", "completion": "Resolved billing issue for John Smith at 555-123-4567"}
After cleaning:
{"prompt": "Customer [EMAIL] called about order #12345", "completion": "Resolved billing issue for [NAME] at [PHONE]"}
The model learns the pattern (how to handle billing issues) without memorizing specific customers. This is critical for GDPR compliance — Article 17 (right to erasure) becomes impossible to enforce if PII is embedded in model weights.
Step 3: Minify to reduce token waste
Fine-tuning is billed by token. Pretty-printed JSON wastes tokens on whitespace:
Pretty-printed (78 tokens):
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
]
}
Minified (34 tokens):
{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is the capital of France?"}]}
Same data, 56% fewer tokens. Over a 10,000-record training set, this saves significant cost.
Cleaning CSV for LLM workflows
Converting CSV to JSON
LLMs work better with JSON than CSV. The CSV to JSON converter transforms tabular data into structured objects.
CSV input:
name,email,issue,status
John Smith,john@example.com,Billing error,Open
Jane Doe,jane@example.com,Shipping delay,Resolved
JSON output (after conversion + PII redaction):
[
{"name": "[NAME]", "email": "[EMAIL]", "issue": "Billing error", "status": "Open"},
{"name": "[NAME]", "email": "[EMAIL]", "issue": "Shipping delay", "status": "Resolved"}
]
Handling large CSVs
For datasets with thousands of rows:
- Sample first: Process a representative sample through CleanMyPrompt to verify the cleaning pattern works
- Use the API: For batch processing, integrate the CleanMyPrompt REST API into your data pipeline
- Validate after cleaning: Ensure the JSON output is valid before feeding it into your training script
OpenAI fine-tuning format
OpenAI expects JSONL with a specific messages structure:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Cleaning pipeline for OpenAI fine-tuning:
- Extract conversation pairs from your source data
- Clean each
contentfield: strip PII, fix formatting, remove noise - Format as JSONL
- Validate:
python -c "import json; [json.loads(l) for l in open('data.jsonl')]"
Common mistakes:
- Trailing newlines: JSONL files should not have a trailing empty line
- Mixed encodings: Ensure all content is UTF-8 before processing
- HTML entities in text:
&,<,'— decode these before training - Duplicate records: Deduplicate by content, not by ID
Few-shot prompt data preparation
For few-shot prompting (not fine-tuning), the same cleaning principles apply but at smaller scale. You're curating 3-10 examples to include in your prompt.
Best practices:
- Diversity: Choose examples that cover different scenarios
- Consistency: Keep the format identical across all examples
- Brevity: Compress each example to minimize token usage
- Privacy: Redact any real PII — use realistic but synthetic data instead
Example prompt with cleaned data:
Given a support ticket, classify the category and urgency.
Example 1:
Input: "[EMAIL] reports product defect, requests replacement"
Output: {"category": "product", "urgency": "high"}
Example 2:
Input: "[EMAIL] asks about shipping timeline for order [ORDER-ID]"
Output: {"category": "shipping", "urgency": "low"}
Now classify:
Input: "Customer reports login issues after password reset"
The bottom line
Clean data produces better models. The workflow is: validate structure → strip PII → normalize formatting → compress tokens → verify. CleanMyPrompt handles all of these steps in your browser, so your training data never touches an external server.
Start with the JSON cleaner or CSV converter to process your next batch of training data.