Extract

The extract node uses AI to pull structured data from documents based on a schema you define. It is the core action node for document processing in DocPipe. You define the fields you want, and the AI extracts them from the document content.

When to use extract

You want structured fields out of a document: invoices, forms, contracts, statements, IDs.
You want the AI to handle OCR and field-finding in one step. Extract runs OCR internally; you don’t need a separate parse node in front of it.
You want a confidence signal per field for downstream review. Enable Field confidence with Engine 1.
Use parse instead when you only need raw text, or classify when the goal is to sort documents rather than read fields off them.

For schema and engine guidance, see extract best practices and schema design.

Configuration

Field	Type	Required	Description
Engine	select	Yes	Extraction engine: `Engine 1` or `Engine 2`
Schema	schema editor	Yes	Defines the fields to extract (name, type, description)
Precision	select	No	Processing precision: `Small`, `Medium`, or `High`
Chunk Strategy	select	No	How document chunks are processed: `Parallel` (default) or `Sequential`
Extraction hints	string	No	Natural language instructions to guide the AI extraction
Field confidence	toggle	No	When enabled, the AI returns a confidence score (0–1) for each extracted field. Available with Engine 1 only
Force OCR	toggle	No	Forces OCR processing on the document even when it contains selectable text. Useful for scanned PDFs with inaccurate text layers
OCR grounding	toggle	No	Uses OCR output to ground LLM extraction, improving accuracy on complex layouts. Available with Engine 1 only
JSON path	string	No	JSONPath expression to extract a subset of the output
Custom prompt	string	No	Natural language instructions for AI schema generation (max 2000 characters)

Engines

Engine 1: Faster, lower cost. Best for simple documents with clear structure.
Engine 2: More capable. Best for complex documents, handwriting, or ambiguous layouts.

Chunk strategy

Long documents are split into chunks before extraction. The Chunk Strategy controls how those chunks are processed.

Parallel (default) runs chunks concurrently for the fastest wall-clock time.
Sequential runs chunks one at a time. It’s slower but can improve accuracy on long documents with tables or array-heavy schemas, where context from earlier chunks helps the model align rows in later ones.

Start with Parallel. Switch to Sequential if you see misaligned rows or duplicated array items on long documents.

Custom prompt for schema generation

When using Generate to let AI create a schema from a sample document, you can provide a custom prompt to specify which fields to extract. The AI will include only the fields described in your prompt, ignoring other visible data in the document. This is useful when a document contains many fields but you only need a subset.

You can click Generate with an empty schema or with only a few fields partially filled in. DocPipe no longer requires every field to be populated before running generation.

JSON path

Leave JSON path blank to keep the full extraction output. Set it when you want to narrow the result down to a specific section of the extracted JSON. For example, use it to pull a single nested array out and pass only that array to downstream nodes.

Schema field types

The schema editor accepts the JSON Schema primitives (string, number, integer, boolean, object, array) plus three semantic types: date, email, and phone. Semantic types serialize to string with a JSON Schema format, so any downstream consumer that reads the schema continues to work. date fields take an output format on the field itself (ISO 8601, US, EU, or Long). Extract normalizes the value it reads from the document to that format, regardless of how the document writes the date. See date output formats in the schema design guide.

Unverified fields

When Field confidence is enabled, fields whose confidence score falls below an internal threshold are marked as unverified in the extraction result. You can pair this with a Review node configured with Pause mode = Unverified Fields to automatically queue runs for human review whenever the extraction is uncertain.

Bounding box overlay

After an extraction run completes, the document viewer displays colored bounding box overlays on the original document showing where each field value was found:

The overlay color indicates the confidence tier: green for high confidence, amber for medium, red for low.
Toggle the overlay on or off using the bbox toggle above the document viewer.
Switch between overlay view and table view to inspect all detected regions at once.
Hover over a bounding box to see the exact confidence score.
Click a bounding box to select the corresponding field in the side panel.

Inputs and outputs

Allowed inputs: Trigger nodes, route, parse, review. Output: Structured JSON data matching the configured schema.

Extract best practices

Engine choice, confidence patterns, and pitfalls

Schema design

How to write JSON Schema for DocPipe extractions

Review action

Pause runs with unverified fields for human review

Validation action

Enforce required fields and formats after extraction

Getting started

Learn

Guides

Reference

Administration

When to use extract

Configuration

Engines

Chunk strategy

Custom prompt for schema generation

JSON path

Schema field types

Unverified fields

Bounding box overlay

Inputs and outputs

Extract best practices

Schema design

Review action

Validation action

Getting started

Learn

Guides

Reference

Administration

Documentation Index

​When to use extract

​Configuration

​Engines

​Chunk strategy

​Custom prompt for schema generation

​JSON path

​Schema field types

​Unverified fields

​Bounding box overlay

​Inputs and outputs

​Related

Extract best practices

Schema design

Review action

Validation action

When to use extract

Configuration

Engines

Chunk strategy

Custom prompt for schema generation

JSON path

Schema field types

Unverified fields

Bounding box overlay

Inputs and outputs

Related