Why Redaction Matters Before You Prompt
Sending unredacted text to AI models creates two distinct risks:
Security risk: API keys, database credentials, private keys, and authentication tokens transmitted to third-party AI infrastructure can be exposed through logging, breaches, or session syncing. A leaked Stripe live key can generate fraudulent charges within seconds of exposure.
Compliance risk: Under GDPR Article 6, processing personal data requires a lawful basis. Pasting customer names, emails, and phone numbers into ChatGPT is almost certainly not covered by your existing legal basis. Under HIPAA, patient data in any form — including text sent to an AI — requires BAAs that most AI providers don't offer for free tiers.
The fix is not to stop using AI tools. The fix is a redaction step before text reaches the model.
The Two Categories of Sensitive Data
Category 1: Secrets and Credentials
These are the most urgent — automated systems can abuse them within seconds:
| Type | Example pattern | Risk level |
|---|---|---|
| OpenAI API key | sk-proj-... |
HIGH |
| Anthropic key | sk-ant-api03-... |
HIGH |
| AWS key pair | AKIA... + secret |
HIGH |
| Stripe live key | sk_live_... |
HIGH |
| GitHub token | ghp_..., ghs_... |
HIGH |
| Slack token | xoxb-..., xoxp-... |
HIGH |
| Database URL | postgres://user:pass@host/db |
HIGH |
| Private key PEM | -----BEGIN RSA PRIVATE KEY----- |
HIGH |
| Bearer token | Authorization: Bearer eyJ... |
HIGH |
Category 2: Personal Data (PII)
These trigger compliance obligations:
| Type | Example | Regulations |
|---|---|---|
| Email address | jane@company.com |
GDPR, CCPA |
| Phone number | +1 (555) 867-5309 |
GDPR, CCPA |
| SSN / National ID | 123-45-6789 |
HIPAA, GDPR |
| Credit card | 4111 1111 1111 1111 |
PCI-DSS |
| IBAN | DE89 3704 0044 0532... |
GDPR |
| IP address | 192.168.1.105 |
GDPR |
| Name + context | Dr. Sarah Johnson |
GDPR |
Redaction vs. Anonymization vs. Pseudonymization
These terms are used interchangeably but mean different things legally and technically:
Redaction — Replace the value with a generic label. john@example.com → [EMAIL]. The label is consistent, so the model understands structure but cannot reconstruct the original.
Anonymization — Irreversibly remove identifying information. Under GDPR, truly anonymized data is outside the regulation's scope. But true anonymization is hard — a name + employer + city can re-identify someone even without an email address.
Pseudonymization — Replace identifiers with pseudonyms or tokens. john@example.com → user_a7f2c. Reversible with a mapping table. GDPR still applies to pseudonymized data, but it's treated as a risk-reduction measure.
For AI prompts, redaction is the right tool. You want the model to understand the structure of the data without seeing the actual values.
How to Redact: Three Approaches
Approach 1: Manual Review (Doesn't Scale)
Ctrl+F for known patterns. Works for one-off cases. Fails for:
- Patterns you haven't thought to check (NPI numbers, Twilio SIDs, IBAN variants)
- High-volume workflows (50 support tickets a day)
- Junior team members who don't know every sensitive pattern
Approach 2: Rule-Based Automated Redaction (Recommended)
Regex-based pattern matching against a comprehensive rule set. Fast, deterministic, explainable. CleanMyPrompt uses this approach with 35+ pattern groups across secrets, PII, and code-context rules.
# CLI: redact a file in-place
npm install -g cleanmyprompt
cmp fix customer-data.txt
# REST API: redact programmatically
curl -X POST https://cleanmyprompt.io/api/v1/clean \
-H "Content-Type: application/json" \
-d '{"text": "Contact john@acme.com re: sk-live-xxx charge", "redact": true}'
# Returns: "Contact [EMAIL] re: [STRIPE-KEY] charge"
Approach 3: NLP-Based Entity Recognition (For Edge Cases)
Neural NER models (spaCy, AWS Comprehend, Presidio) catch names and organizations that regex misses. Slower, requires infrastructure, and can have false positives. Best used as a second pass after regex redaction for high-stakes use cases.
The Patterns That Get Missed Most
Rule-based redaction catches known patterns. These are the ones that commonly slip through even good scanners:
Contextual secrets — Values that don't have a recognizable prefix but are clearly secrets by context:
api_key = "a7f2c9b4e1d8..." # high-entropy string in assignment context
SECRET_KEY = "xJ9kLmPqRs..." # .env-style pattern
process.env.TOKEN = "..." # Node.js env assignment
Composite identifiers — No single field is sensitive, but combined they are:
"Sarah Johnson, DOB 1985-03-15, Acme Corp, New York"
Encoded secrets — Base64 or hex-encoded credentials embedded in config:
{"auth": "c2stbGl2ZS14eHh4eHh4eA=="} // base64-encoded Stripe key
Implicit secrets — Values that are only sensitive in context:
database:
password: my-prod-password-2026 # low entropy, not a token, but definitely sensitive
Building a Redaction Workflow for Your Team
Step 1: Establish What's Sensitive
Create a data classification policy. At minimum: credentials (always HIGH), customer PII (MEDIUM+), internal IDs (LOW). Communicate this to the team with examples.
Step 2: Choose the Right Tool
| If you... | Use... |
|---|---|
| Paste into AI tools manually | Web app or browser extension |
| Work in VS Code with Copilot | VS Code extension |
| Process files in CI/CD | CLI with cmp fix --recursive |
| Have a multi-step pipeline | REST API |
Step 3: Add Pre-Commit Protection
cmp install-hook # adds .git/hooks/pre-commit
Blocks commits that include secrets. No configuration needed. Works alongside gitleaks and detect-secrets.
Step 4: Validate With Tests
For teams building their own redaction logic:
// Always test your rules against:
// 1. Known-good examples from each pattern category
// 2. Edge cases (empty values, already-redacted values, similar-but-not-sensitive strings)
// 3. Idempotency: redacting an already-redacted file should produce identical output
The Quick Check: Are You Already Leaking?
Go to cleanmyprompt.io, paste the last text you sent to ChatGPT or Copilot Chat, and toggle Redact PII. If findings appear, you sent those values to the model.
For ongoing protection: install the VS Code extension or browser extension. Both run locally — nothing leaves your machine.
See also: API Key Leaks in AI Prompts — Real-World Risks · GDPR-Compliant AI Prompts