Preparing JSON and CSV Data for LLM Fine-Tuning: A Cleaning Guide

Fine-tuning LLMs and building few-shot prompts require clean, well-structured data. But real-world datasets are messy — inconsistent formatting, embedded PII, broken encodings, and wasted tokens. Here's how to prepare your data properly.

Why data cleaning matters for fine-tuning

The quality of your fine-tuned model directly reflects the quality of your training data. Common problems:

PII in training data: Customer names, emails, and phone numbers baked into your model's weights — a compliance nightmare
Inconsistent formatting: Mixed JSON structures confuse the model during training
Wasted tokens: Verbose fields, redundant whitespace, and unnecessary metadata inflate costs without adding signal
Encoding issues: UTF-8 BOM markers, Unicode replacement characters, and escaped entities corrupt text

Cleaning before training prevents all of these.

Cleaning JSON for fine-tuning

Step 1: Validate and normalize structure

Most LLM fine-tuning expects JSONL (JSON Lines) format — one JSON object per line. Use CleanMyPrompt's JSON cleaner to validate and normalize your data.

Common issues the tool catches:

Trailing commas (invalid JSON)
Single quotes instead of double quotes
Unescaped special characters
Inconsistent whitespace

Step 2: Strip PII from training records

If your fine-tuning data comes from customer interactions, it contains PII.

Before cleaning:

{"prompt": "Customer john.smith@acme.com called about order #12345", "completion": "Resolved billing issue for John Smith at 555-123-4567"}

After cleaning:

{"prompt": "Customer [EMAIL] called about order #12345", "completion": "Resolved billing issue for [NAME] at [PHONE]"}

The model learns the pattern (how to handle billing issues) without memorizing specific customers. This is critical for GDPR compliance — Article 17 (right to erasure) becomes impossible to enforce if PII is embedded in model weights.

Step 3: Minify to reduce token waste

Fine-tuning is billed by token. Pretty-printed JSON wastes tokens on whitespace:

Pretty-printed (78 tokens):

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}

Minified (34 tokens):

{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is the capital of France?"}]}

Same data, 56% fewer tokens. Over a 10,000-record training set, this saves significant cost.

Cleaning CSV for LLM workflows

Converting CSV to JSON

LLMs work better with JSON than CSV. The CSV to JSON converter transforms tabular data into structured objects.

CSV input:

name,email,issue,status
John Smith,john@example.com,Billing error,Open
Jane Doe,jane@example.com,Shipping delay,Resolved

JSON output (after conversion + PII redaction):

[
  {"name": "[NAME]", "email": "[EMAIL]", "issue": "Billing error", "status": "Open"},
  {"name": "[NAME]", "email": "[EMAIL]", "issue": "Shipping delay", "status": "Resolved"}
]

Handling large CSVs

For datasets with thousands of rows:

Sample first: Process a representative sample through CleanMyPrompt to verify the cleaning pattern works
Use the API: For batch processing, integrate the CleanMyPrompt REST API into your data pipeline
Validate after cleaning: Ensure the JSON output is valid before feeding it into your training script

OpenAI fine-tuning format

OpenAI expects JSONL with a specific messages structure:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Cleaning pipeline for OpenAI fine-tuning:

Extract conversation pairs from your source data
Clean each content field: strip PII, fix formatting, remove noise
Format as JSONL
Validate: python -c "import json; [json.loads(l) for l in open('data.jsonl')]"

Common mistakes:

Trailing newlines: JSONL files should not have a trailing empty line
Mixed encodings: Ensure all content is UTF-8 before processing
HTML entities in text: &, <, ' — decode these before training
Duplicate records: Deduplicate by content, not by ID

Few-shot prompt data preparation

For few-shot prompting (not fine-tuning), the same cleaning principles apply but at smaller scale. You're curating 3-10 examples to include in your prompt.

Best practices:

Diversity: Choose examples that cover different scenarios
Consistency: Keep the format identical across all examples
Brevity: Compress each example to minimize token usage
Privacy: Redact any real PII — use realistic but synthetic data instead

Example prompt with cleaned data:

Given a support ticket, classify the category and urgency.

Example 1:
Input: "[EMAIL] reports product defect, requests replacement"
Output: {"category": "product", "urgency": "high"}

Example 2:
Input: "[EMAIL] asks about shipping timeline for order [ORDER-ID]"
Output: {"category": "shipping", "urgency": "low"}

Now classify:
Input: "Customer reports login issues after password reset"

The bottom line

Clean data produces better models. The workflow is: validate structure → strip PII → normalize formatting → compress tokens → verify. CleanMyPrompt handles all of these steps in your browser, so your training data never touches an external server.

Start with the JSON cleaner or CSV converter to process your next batch of training data.