PDFs are the most common document format in business, but they're terrible for AI. When you copy-paste from a PDF into ChatGPT, you get broken line breaks mid-sentence, page numbers scattered through paragraphs, headers and footers repeated on every page, and unicode artifacts invisible to the eye.
This wastes tokens and confuses the model. Here's how to fix it.
Why PDF text is problematic for LLMs
PDFs store text as positioned graphics, not flowing paragraphs. When you copy text, your OS reconstructs the reading order — poorly. Common artifacts:
Broken line breaks
This is a perfectly normal sen-
tence that was split across
two lines in the PDF layout.
Page artifacts
Annual Report 2025 Page 47
The quarterly results showed...
© ConfidentialCo 2025
Unicode garbage
Zero-width spaces (U+200B), soft hyphens (U+00AD), and non-breaking spaces get copied invisibly and consume tokens.
The cleaning pipeline
Step 1: Extract text
Copy from your PDF viewer, or drag-and-drop the PDF file into CleanMyPrompt's upload zone for automatic text extraction.
Step 2: Standard Clean
The Standard mode handles the heavy lifting:
- Merges broken lines: Detects hyphenated line breaks and joins them
- Strips page artifacts: Removes "Page X", "Copyright ©", repeated headers/footers
- Normalizes whitespace: Collapses multiple spaces, removes zero-width characters
- Fixes unicode: Replaces smart quotes, em-dashes, and other problematic characters with clean ASCII equivalents
Step 3: (Optional) Compress
If you're paying per token, switch to Squeeze mode after cleaning to further reduce the token count by 25-40%.
Step 4: Format as JSON
For structured extraction tasks, use JSON mode to convert the cleaned text into a structured format that LLMs can process more reliably.
Before and after
Before (raw PDF copy):
Annual Report Q4 Highlights Page 12
Our revenue grew signi-
ficantly in Q4, reaching
$4.2M compared to $3.1M
in Q3. The growth was dri-
ven primarily by enterprise
adoption.
© 2025 Acme Corp Page 12
After (cleaned):
Our revenue grew significantly in Q4, reaching $4.2M compared to $3.1M in Q3. The growth was driven primarily by enterprise adoption.
Token count: 67 → 29 (57% reduction).
Handling large documents
For documents over 10,000 words, use the chunking feature. CleanMyPrompt automatically splits output into LLM-friendly chunks (configurable size) so you can process them sequentially.
Try it
Head to Clean PDF for ChatGPT — paste your PDF text and see the difference instantly. Everything runs locally in your browser.