Convert PDF Documents to Searchable HTML for Investigative Research

Investigative journalists and document researchers working with large PDF releases — FOIA responses, leaked document sets, regulatory filings, court records — need to search and quote from those documents efficiently. Ctrl+F inside a PDF reader doesn't scale across hundreds of files. Converting PDFs to HTML gives you text you can search with standard tools, grep, or import into document analysis software.

A typical FOIA release arrives as a multi-hundred-page PDF or a ZIP of individual PDF files. Finding the twenty relevant pages in a 600-page release, or locating a specific name across 80 separate documents, requires text access that PDF viewers handle poorly at scale. Converting each PDF to HTML produces files you can search with browser find, command-line text search, or load into tools like Overview, Hypothesis, or a simple text editor for annotation.

Deliteful processes up to 50 PDFs per batch (300 MB per file, 2 GB total), outputting one HTML file per PDF with extracted, HTML-escaped text. For a document dump of manageable size, this replaces a manual extraction workflow or the need to install and configure local PDF tools. The limitation to note: scanned documents without a text layer — common in older FOIA releases — require OCR before this tool can extract them.

How it works

  1. 1

    Sign up free

    Create your Deliteful account with Google — no credit card, about three clicks.

  2. 2

    Upload your document PDFs

    Add up to 50 PDFs from your document release, up to 300 MB each.

  3. 3

    Convert to HTML

    Deliteful extracts embedded text from each PDF and outputs one HTML file per document.

  4. 4

    Search and quote

    Open HTML files in a browser or text editor to search, highlight, and extract quotes for your reporting.

Frequently asked questions

Can I use this to search across a large document release from a FOIA request?
Yes, for digitally created PDFs with a text layer. Convert the batch to HTML files, then use browser search or command-line grep to search across all files simultaneously. Scanned documents need OCR first.
Will redacted sections of PDFs appear in the HTML output?
Visually redacted text — where a black box is drawn over text — will not appear in the output if the underlying text was removed from the PDF. However, PDFs where redaction was applied visually without deleting the text layer may still expose that text during extraction. This is a property of the PDF format, not specific to this tool. Always verify sensitive documents before distributing extracted output.
Does the tool work on court filing PDFs from PACER?
Yes, for PACER documents that contain a selectable text layer, which most digitally filed court documents do. Older scanned filings without OCR will not extract text.
How do I handle a document release that's too large for one batch?
Run multiple batches of up to 50 files each. Each batch has a 2 GB total limit. Process them sequentially and collect the HTML outputs together for searching.
Is the extracted text accurate enough to quote directly in articles?
You should verify any direct quotes against the original PDF. Text extraction is generally accurate for clean digital PDFs, but character encoding, ligatures, and layout complexity can introduce minor errors. Always cite the original document as your source.

Create your free Deliteful account with Google and start converting your document releases to searchable HTML for faster investigative research.