Extraction Tax
You know the drill. You generate a PDF: an architectural diagram, a draft blog post, a technical spec. You send it to stakeholders for review. They do exactly what you asked: they mark it up with comments.
The creation part is solved. The commenting part is solved. But the extraction part is a tax on your sanity.
If you have 40+ bubbles of feedback, your afternoon looks like this:
- Open the PDF on monitor one.
- Open your ticket tracker or Markdown file on monitor two.
- Click comment. Copy text. Alt-Tab. Paste. Alt-Tab. Repeat.
It's manual, error-prone, and a waste of time. You miss comments. You miscopy context. The feedback remains trapped in a proprietary layer on top of your document, completely divorced from your actual workflow.
I didn't want a "better PDF viewer." I wanted a parser that would strip-mine a document for tasks.
Why Existing Tools Fail
Before building, I looked for existing tools. The options were terrible.
- Adobe Acrobat: It can export comments to FDF or XFDF, proprietary formats that no one actually wants to read. Exporting to Word or RTF results in a formatting nightmare.
- Online Converters: Most "PDF to Text" tools ignore annotations entirely. The ones that don't usually require you to upload sensitive documents to a mysterious server.
- Python Scripts: You can use
PyPDF2, but it requires a dev environment and custom logic for every document structure.
The gap was clear: drag, drop, copy, done – a utility that runs entirely in the browser.
Local-First by Design
Stack: Next.js 14 (static export) :: TypeScript :: Tailwind :: pdfjs-dist.
Privacy is Portability
PDFs often contain sensitive data: contracts, internal memos, unreleased specs. By building this as a client-side app using pdfjs-dist, the file never leaves your browser.
There is no server-side processing. You could disconnect your internet and the extraction would still work. This isn't just a privacy feature; it’s a speed feature.
Reconstructing Context
This was the technical hurdle. When you highlight text in a PDF, the file doesn't store the words. It stores coordinates: "User drew a yellow rectangle at [x, y]."
To get your data back, the tool has to perform a geometric intersection:
- Extract Geometry: Get the quad points (corners) of every highlight.
- Map the Text: Parse the page to get the bounding box of every text item.
- Intersect: Run a collision detection loop. If a text item overlaps with the highlight, it belongs to that comment.
- Sort: PDF text isn't always stored in reading order. The tool sorts items by Y then X coordinates to reconstruct the sentence naturally.
This turns abstract coordinates back into the human language you actually need.
Zero Dropped Packets
Two export paths:
| Action | Use Case |
|---|---|
| Copy Checklist | Paste into Google Docs (formatted list) or GitHub/Notion (GFM checkboxes). |
| Copy / Download Markdown | Deep work. Includes page numbers and full context for AI agents or your favorite markdown editor. |
The new loop takes seconds: receive the PDF, drop it into pdfcomments.app, and paste the checklist into your Google Doc.
It saves about 15 minutes of mindless copying per document. More importantly, it ensures every comment becomes a tickable box.
Try It
I built this on a snowy Saturday in about the runtime of the new Tron movie. It works for my use case – your mileage may vary. Free and private. View on GitHub.