PDF Importer

PDF ImporterProject proposal for LLM-powered PDF import in the Seed desktop app.

Problem

Getting content into Seed is still harder than it should be. PDFs are everywhere -- research papers, reports, legal docs, slide exports, scanned notes -- and right now there's no way to bring them in without manually recreating everything block by block.

We've had heuristic PDF extraction code sitting in the codebase for a while (pdfjs, GROBID), but it's brittle. It chokes on multi-column layouts, drops figures, loses formatting, and generally produces output that needs so much cleanup you might as well have typed it. Academic papers with complex layouts? Forget it.

The real pain: users who want to build a knowledge base in Seed have filing cabinets full of PDFs. Without a solid import path, they're stuck copying and pasting, which means they just don't bother.

User Stories

As a researcher, I want to drop a PDF paper into Seed so I can annotate it, link it to other documents, and make it part of my knowledge graph -- without spending 30 minutes reformatting.
As a team member, I want to import a report PDF into our shared space so the whole team can comment on it and reference specific sections.
As a user with a large document collection, I want to batch-import PDFs through the chatbot so I can migrate my existing library into Seed over time.

Solution

The approach: give the PDF to an LLM

Instead of building yet another extraction pipeline with heuristics and edge-case handling, we hand the PDF to an LLM and let it do what LLMs are good at: reading documents and understanding their structure.

The idea is the same as uploading a PDF to ChatGPT and asking it to reproduce the content. We give the LLM:

The PDF (rendered pages as images, or text content, or both -- whatever the model handles best)
The Seed document format spec as context
An instruction to faithfully transcribe the PDF content into Seed format

That's it. We don't prescribe a rigid pipeline. We don't tell the LLM to "first extract headings, then extract paragraphs, then..." -- we let the model figure out the best approach for each document. A two-column academic paper needs different handling than a single-column report, and the LLM can adapt without us writing rules for every layout.

Default intent: faithful transcription

The default behavior -- what happens when a user just drops a PDF -- is faithful transcription. Every heading, paragraph, list, table, image, link, code block, and formula in the PDF should appear in the resulting Seed document, preserving the original structure and reading order.

This is not summarization. Not "extract the key points." Not "rewrite for clarity." The user's baseline expectation is: "I dropped my PDF, and now I have the same document in Seed." If the original PDF has a typo, the Seed document has the same typo.

Because this goes through the chatbot, users can ask for something different if they want to. "Import this but skip the appendices." "Just pull out the methodology section." "Summarize this into a one-pager." But those are explicit requests -- the default is always a full, faithful reproduction.

Image handling

LLMs can read images in PDFs but can't produce binary image data. So image extraction is the one piece of system-level work we need to do outside the LLM. Before (or alongside) the LLM transcription, we extract embedded images from the PDF using tools like pdfimages (poppler) or pypdfium2, save them to temp files, and make the file paths available to the LLM. The LLM then references those paths in the output (file:///tmp/pdf-images/img-001.png), and the publishing step resolves them to IPFS.

Publishing

Once the LLM produces the Seed-format content (markdown with frontmatter or JSON blocks), the system publishes it as a normal Seed document. Local file references (file://) get resolved to IPFS, and the document goes live. This is the same publish path every Seed document uses -- nothing special here.

Entry point: the chatbot

This feature lives in the desktop app's AI chatbot. Two ways to trigger it:

Drag and drop -- Drop a PDF onto the chat area. The chatbot picks it up, imports it, and confirms when done.
Ask the chatbot -- "Import this PDF" with a file attachment, or "Import the PDF at ~/Documents/report.pdf."

The chatbot provides progress feedback ("Reading your PDF...", "Transcribing page 3 of 12...") and lets the user review the result before publishing, or publishes directly depending on context.

Requires LLM capabilities

This feature only works if the user has an LLM provider configured in Seed. No LLM, no PDF import. We're not building a fallback heuristic path -- the whole point is that the LLM approach produces dramatically better results than heuristics, and maintaining two paths adds complexity for marginal benefit.

If a user without LLM access tries to import a PDF, the app should tell them they need to configure an LLM provider and point them to the settings.

Relationship to LLM Wiki

This project is a prerequisite for the LLM Wiki project, which lists "Document imports" as one of its dependencies. LLM Wiki envisions a world where the chatbot can ingest existing documents (PDFs, web pages, etc.) and integrate them into a structured knowledge base. PDF import is the first and most important piece of that puzzle.

No Gos

No heuristic/pdfjs fallback path. We're not maintaining the old extraction code or building a non-LLM alternative. If the LLM approach works (and it does), there's no reason to keep two paths.
No server-side processing. PDF import runs locally on the user's machine using their configured LLM provider. We're not routing PDFs through our servers.
No batch import UI. Users can ask the chatbot to import multiple PDFs in sequence, but we're not building a dedicated bulk-import interface. That's a separate project if there's demand.
No PDF editing. This is one-way: PDF goes in, Seed document comes out. We're not building a PDF viewer, annotator, or round-trip editor.
No OCR for scanned documents. If the PDF is just scanned images with no text layer, the LLM's vision capabilities may handle it, but we're not building dedicated OCR infrastructure. That's a rabbit hole.

References

LLM Wiki project -- the downstream project that depends on PDF import.
File Import Libraries Research -- prior research on extraction tools and approaches.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime