Preparing JSON and CSV Data for LLM Fine-Tuning: A Cleaning Guide

2026-03-27

Fine-tuning LLMs and building few-shot prompts require clean, well-structured data. But real-world datasets are messy — inconsistent formatting, embedded PII, broken encodings, and wasted tokens. Here's how to prepare your data properly.

Why data cleaning matters for fine-tuning

The quality of your fine-tuned model directly reflects the quality of your training data. Common problems:

Cleaning before training prevents all of these.

Cleaning JSON for fine-tuning

Step 1: Validate and normalize structure

Most LLM fine-tuning expects JSONL (JSON Lines) format — one JSON object per line. Use CleanMyPrompt's JSON cleaner to validate and normalize your data.

Common issues the tool catches:

Step 2: Strip PII from training records

If your fine-tuning data comes from customer interactions, it contains PII.

Before cleaning:

{"prompt": "Customer john.smith@acme.com called about order #12345", "completion": "Resolved billing issue for John Smith at 555-123-4567"}

After cleaning:

{"prompt": "Customer [EMAIL] called about order #12345", "completion": "Resolved billing issue for [NAME] at [PHONE]"}

The model learns the pattern (how to handle billing issues) without memorizing specific customers. This is critical for GDPR compliance — Article 17 (right to erasure) becomes impossible to enforce if PII is embedded in model weights.

Step 3: Minify to reduce token waste

Fine-tuning is billed by token. Pretty-printed JSON wastes tokens on whitespace:

Pretty-printed (78 tokens):

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}

Minified (34 tokens):

{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is the capital of France?"}]}

Same data, 56% fewer tokens. Over a 10,000-record training set, this saves significant cost.

Cleaning CSV for LLM workflows

Converting CSV to JSON

LLMs work better with JSON than CSV. The CSV to JSON converter transforms tabular data into structured objects.

CSV input:

name,email,issue,status
John Smith,john@example.com,Billing error,Open
Jane Doe,jane@example.com,Shipping delay,Resolved

JSON output (after conversion + PII redaction):

[
  {"name": "[NAME]", "email": "[EMAIL]", "issue": "Billing error", "status": "Open"},
  {"name": "[NAME]", "email": "[EMAIL]", "issue": "Shipping delay", "status": "Resolved"}
]

Handling large CSVs

For datasets with thousands of rows:

  1. Sample first: Process a representative sample through CleanMyPrompt to verify the cleaning pattern works
  2. Use the API: For batch processing, integrate the CleanMyPrompt REST API into your data pipeline
  3. Validate after cleaning: Ensure the JSON output is valid before feeding it into your training script

OpenAI fine-tuning format

OpenAI expects JSONL with a specific messages structure:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Cleaning pipeline for OpenAI fine-tuning:

  1. Extract conversation pairs from your source data
  2. Clean each content field: strip PII, fix formatting, remove noise
  3. Format as JSONL
  4. Validate: python -c "import json; [json.loads(l) for l in open('data.jsonl')]"

Common mistakes:

Few-shot prompt data preparation

For few-shot prompting (not fine-tuning), the same cleaning principles apply but at smaller scale. You're curating 3-10 examples to include in your prompt.

Best practices:

  1. Diversity: Choose examples that cover different scenarios
  2. Consistency: Keep the format identical across all examples
  3. Brevity: Compress each example to minimize token usage
  4. Privacy: Redact any real PII — use realistic but synthetic data instead

Example prompt with cleaned data:

Given a support ticket, classify the category and urgency.

Example 1:
Input: "[EMAIL] reports product defect, requests replacement"
Output: {"category": "product", "urgency": "high"}

Example 2:
Input: "[EMAIL] asks about shipping timeline for order [ORDER-ID]"
Output: {"category": "shipping", "urgency": "low"}

Now classify:
Input: "Customer reports login issues after password reset"

The bottom line

Clean data produces better models. The workflow is: validate structure → strip PII → normalize formatting → compress tokens → verify. CleanMyPrompt handles all of these steps in your browser, so your training data never touches an external server.

Start with the JSON cleaner or CSV converter to process your next batch of training data.