Engineering March 2026

Building an AI proofreader that actually shows you what changed

terminal

ProofMate Engineering

How we built it

schedule 6 min read

Most AI writing tools work the same way: paste text in, get rewritten text back, squint at both versions trying to figure out what actually changed. This is fine for a paragraph. It falls apart completely for a 60-page thesis.

We built ProofMate to solve a specific problem: check entire PDF documents for errors and show every correction as an inline diff — original and fixed version side by side, changed words highlighted. Here is how the processing pipeline works and what we learned about making LLM output useful for proofreading.

The pipeline

The system has four stages. Each one exists because the previous one produces output that is not directly useful to a human.

1. Text extraction and cleanup

We extract text from the uploaded PDF page by page. This sounds trivial until you deal with real-world documents. Academic PDFs — especially those generated by typesetting systems — split words across lines with hyphens ("docu-ment", "re-search"). Some producers encode special characters incorrectly (diacritics rendered as two separate characters). Page numbers appear as isolated digits.

The extraction step cleans all of this: reconstruct broken characters, strip page-number lines, skip empty pages. Crucially, we preserve the page number mapping so every correction can be traced back to its exact location in the original document.

2. LLM-based error detection

Each page goes to a language model with a structured prompt. The model returns an array of errors, where each error contains the original sentence, the corrected sentence, an error type (typo, grammar, style, punctuation), a confidence score, and a brief explanation.

The critical design choice: we ask the model to return both the original and corrected sentence with identical surrounding context. Not just the wrong word and the right word — the full sentence with several words of context on each side, changed in only the problematic part. This constraint is what makes diff generation possible downstream.

Pages are checked concurrently and results stream to the browser via Server-Sent Events as each page completes. The user sees corrections appearing within seconds of uploading, not after the entire document is processed.

3. Post-processing and filtering

Language models hallucinate corrections. This is the part nobody talks about in "AI proofreading" marketing. Our post-processing pipeline catches three categories of false positives:

  • Phantom corrections: The model returns an "error" where original and corrected text are the same string. This happens more often than you would expect. Simple string comparison catches it.
  • Extraction artifact corrections: Despite explicit prompt instructions, the model still flags line-break hyphens ("docu-ment" → "document") as typos. We normalize both versions by stripping hyphens and whitespace and compare — if they match, it was an artifact, not an error. Dropped.
  • Mixed corrections: Sometimes a sentence has both a real error and an extraction artifact. We reconstruct the artifact-free version of both strings first, then check if a real difference remains. If yes, keep the correction with cleaned text. If no, drop it.

4. Word-level diff rendering

This is where the sentence pair becomes useful. The browser receives both strings and runs a word-level diff algorithm. Removed words get a red highlight, inserted words get green. The user never has to read two sentences and mentally spot the difference — it is immediately visible.

This only works because of the context constraint we enforce in step 2. If the model rephrases the whole sentence, the diff lights up everywhere and becomes useless. By anchoring the surrounding context, the diff highlights exactly the 1-3 words that actually changed.

What the user sees

typo 97%
Page 23

Misspelling of "performance". The surrounding context is preserved identically.

Original

The measured preformance of the system exceeded initial expectations by a wide margin.

Corrected

The measured performance of the system exceeded initial expectations by a wide margin.

Why sentence pairs, not full rewrites

The obvious approach to AI proofreading is: send the text to the model, get back the corrected version, diff the whole thing. We tried this early on. It does not work, for three reasons.

First, the model reformulates sentences it considers "unclear" even when there is no actual error. A whole-page diff between original and model output becomes a wall of red and green with no way to distinguish corrections from rewrites.

Second, cost scales with document length because you are asking the model to output the entire corrected text. With sentence pairs, the model only outputs sentences that contain errors. For a clean page, the response is essentially empty.

Third, aligning a full-text diff back to the original document layout is genuinely hard. With sentence pairs, each correction is self-contained and maps to a specific page.

Streaming results to the user

Proofreading a 40-page document takes 30-60 seconds depending on content density. Making the user stare at a loading spinner for that long is unacceptable.

Instead, the upload returns a results page immediately — showing document statistics (page count, word count) while analysis runs in the background. Pages are analyzed concurrently, bounded to avoid overwhelming the model API. As each page completes, its corrections are pushed to the browser as events. Cards animate in one by one, error counters update in real time.

The user starts reading corrections for one part of the document while the rest is still being processed. Time-to-first-result drops from 30+ seconds to 2-3 seconds.

Dealing with messy PDFs

Academic PDFs are the primary use case, and they are hostile to text extraction. The hyphenation problem is the biggest: typesetting systems break words across lines and insert hyphens. When you extract the text, you get broken words. The model sees these and "corrects" them, which is not a correction — it is an extraction artifact.

We handle this at two levels. The prompt explicitly instructs the model to ignore hyphenated words that look like line breaks. This eliminates about 70% of false positives. The remaining 30% are caught by the post-processing filter that normalizes both strings and compares them.

The tricky edge case: a sentence that has both a real error and a line-break artifact. The model returns one correction that fixes both. The filter detects this, reconstructs both strings without the artifact, confirms the real error still exists, and keeps the correction with cleaned text.

What we learned

The key insight is that the hard part of AI proofreading is not finding errors — language models are good at that. The hard part is presenting corrections in a way that humans can quickly evaluate. A list of "here is what was wrong" is not enough. You need to show the exact change in its original context, with the difference highlighted at word level.

This requires constraining the model output carefully: same context, minimal changes, structured format. And it requires a post-processing layer that catches the cases where the model ignores those constraints. The model is the engine, but the pipeline around it is what makes the product usable.

"The hard part of AI proofreading is not finding errors. It is showing corrections in a way humans can evaluate in seconds."

Try it yourself

Upload a PDF and watch the corrections stream in with highlighted diffs.