CleanMyPrompt
2026-03-22CleanMyPrompt Team2 min read

How to Clean PDF Text for LLMs (ChatGPT, Claude, Gemini)

PDF text extraction is messy — broken lines, page numbers, headers. Here's how to clean it for AI consumption in seconds.

pdfformattingchatgptclaude

PDFs are the most common document format in business, but they're terrible for AI. When you copy-paste from a PDF into ChatGPT, you get broken line breaks mid-sentence, page numbers scattered through paragraphs, headers and footers repeated on every page, and unicode artifacts invisible to the eye.

This wastes tokens and confuses the model. Here's how to fix it.

Why PDF text is problematic for LLMs

PDFs store text as positioned graphics, not flowing paragraphs. When you copy text, your OS reconstructs the reading order — poorly. Common artifacts:

Broken line breaks

This is a perfectly normal sen-
tence that was split across
two lines in the PDF layout.

Page artifacts

Annual Report 2025                                    Page 47

The quarterly results showed...

© ConfidentialCo 2025

Unicode garbage

Zero-width spaces (U+200B), soft hyphens (U+00AD), and non-breaking spaces get copied invisibly and consume tokens.

The cleaning pipeline

Step 1: Extract text

Copy from your PDF viewer, or drag-and-drop the PDF file into CleanMyPrompt's upload zone for automatic text extraction.

Step 2: Standard Clean

The Standard mode handles the heavy lifting:

  • Merges broken lines: Detects hyphenated line breaks and joins them
  • Strips page artifacts: Removes "Page X", "Copyright ©", repeated headers/footers
  • Normalizes whitespace: Collapses multiple spaces, removes zero-width characters
  • Fixes unicode: Replaces smart quotes, em-dashes, and other problematic characters with clean ASCII equivalents

Step 3: (Optional) Compress

If you're paying per token, switch to Squeeze mode after cleaning to further reduce the token count by 25-40%.

Step 4: Format as JSON

For structured extraction tasks, use JSON mode to convert the cleaned text into a structured format that LLMs can process more reliably.

Before and after

Before (raw PDF copy):

Annual Report        Q4 Highlights                         Page 12

Our revenue grew signi-
ficantly in Q4, reaching
$4.2M compared to $3.1M
in Q3. The growth was dri-
ven primarily by enterprise
adoption.

© 2025 Acme Corp                                          Page 12

After (cleaned):

Our revenue grew significantly in Q4, reaching $4.2M compared to $3.1M in Q3. The growth was driven primarily by enterprise adoption.

Token count: 67 → 29 (57% reduction).

Handling large documents

For documents over 10,000 words, use the chunking feature. CleanMyPrompt automatically splits output into LLM-friendly chunks (configurable size) so you can process them sequentially.

Try it

Head to Clean PDF for ChatGPT — paste your PDF text and see the difference instantly. Everything runs locally in your browser.

Try CleanMyPrompt

Strip PII, compress tokens, and clean text for AI — 100% in your browser. No sign-up required.

Try It Free