How to Clean PDF Text for LLMs (ChatGPT, Claude, Gemini)

2026-03-22

PDFs are the most common document format in business, but they're terrible for AI. When you copy-paste from a PDF into ChatGPT, you get broken line breaks mid-sentence, page numbers scattered through paragraphs, headers and footers repeated on every page, and unicode artifacts invisible to the eye.

This wastes tokens and confuses the model. Here's how to fix it.

Why PDF text is problematic for LLMs

PDFs store text as positioned graphics, not flowing paragraphs. When you copy text, your OS reconstructs the reading order — poorly. Common artifacts:

Broken line breaks

This is a perfectly normal sen-
tence that was split across
two lines in the PDF layout.

Page artifacts

Annual Report 2025                                    Page 47

The quarterly results showed...

© ConfidentialCo 2025

Unicode garbage

Zero-width spaces (U+200B), soft hyphens (U+00AD), and non-breaking spaces get copied invisibly and consume tokens.

The cleaning pipeline

Step 1: Extract text

Copy from your PDF viewer, or drag-and-drop the PDF file into CleanMyPrompt's upload zone for automatic text extraction.

Step 2: Standard Clean

The Standard mode handles the heavy lifting:

Step 3: (Optional) Compress

If you're paying per token, switch to Squeeze mode after cleaning to further reduce the token count by 25-40%.

Step 4: Format as JSON

For structured extraction tasks, use JSON mode to convert the cleaned text into a structured format that LLMs can process more reliably.

Before and after

Before (raw PDF copy):

Annual Report        Q4 Highlights                         Page 12

Our revenue grew signi-
ficantly in Q4, reaching
$4.2M compared to $3.1M
in Q3. The growth was dri-
ven primarily by enterprise
adoption.

© 2025 Acme Corp                                          Page 12

After (cleaned):

Our revenue grew significantly in Q4, reaching $4.2M compared to $3.1M in Q3. The growth was driven primarily by enterprise adoption.

Token count: 67 → 29 (57% reduction).

Handling large documents

For documents over 10,000 words, use the chunking feature. CleanMyPrompt automatically splits output into LLM-friendly chunks (configurable size) so you can process them sequentially.

Try it

Head to Clean PDF for ChatGPT — paste your PDF text and see the difference instantly. Everything runs locally in your browser.