Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.docpipe.ai/llms.txt

Use this file to discover all available pages before exploring further.

The extract node uses AI to pull structured data from documents based on a schema you define. It is the core action node for document processing in DocPipe. You define the fields you want, and the AI extracts them from the document content.

When to use extract

  • You want structured fields out of a document: invoices, forms, contracts, statements, IDs.
  • You want the AI to handle OCR and field-finding in one step. Extract runs OCR internally; you don’t need a separate parse node in front of it.
  • You want a confidence signal per field for downstream review. Enable Field confidence with Engine 1.
  • Use parse instead when you only need raw text, or classify when the goal is to sort documents rather than read fields off them.
For schema and engine guidance, see extract best practices and schema design.

Configuration

FieldTypeRequiredDescription
EngineselectYesExtraction engine: Engine 1 or Engine 2
Schemaschema editorYesDefines the fields to extract (name, type, description)
PrecisionselectNoProcessing precision: Small, Medium, or High
Chunk StrategyselectNoHow document chunks are processed: Parallel (default) or Sequential
Extraction hintsstringNoNatural language instructions to guide the AI extraction
Field confidencetoggleNoWhen enabled, the AI returns a confidence score (0–1) for each extracted field. Available with Engine 1 only
Force OCRtoggleNoForces OCR processing on the document even when it contains selectable text. Useful for scanned PDFs with inaccurate text layers
OCR groundingtoggleNoUses OCR output to ground LLM extraction, improving accuracy on complex layouts. Available with Engine 1 only
JSON pathstringNoJSONPath expression to extract a subset of the output
Custom promptstringNoNatural language instructions for AI schema generation (max 2000 characters)

Engines

  • Engine 1: Faster, lower cost. Best for simple documents with clear structure.
  • Engine 2: More capable. Best for complex documents, handwriting, or ambiguous layouts.

Chunk strategy

Long documents are split into chunks before extraction. The Chunk Strategy controls how those chunks are processed.
  • Parallel (default) runs chunks concurrently for the fastest wall-clock time.
  • Sequential runs chunks one at a time. It’s slower but can improve accuracy on long documents with tables or array-heavy schemas, where context from earlier chunks helps the model align rows in later ones.
Start with Parallel. Switch to Sequential if you see misaligned rows or duplicated array items on long documents.

Custom prompt for schema generation

When using Generate to let AI create a schema from a sample document, you can provide a custom prompt to specify which fields to extract. The AI will include only the fields described in your prompt, ignoring other visible data in the document. This is useful when a document contains many fields but you only need a subset.
You can click Generate with an empty schema or with only a few fields partially filled in. DocPipe no longer requires every field to be populated before running generation.

JSON path

Leave JSON path blank to keep the full extraction output. Set it when you want to narrow the result down to a specific section of the extracted JSON. For example, use it to pull a single nested array out and pass only that array to downstream nodes.

Schema field types

The schema editor accepts the JSON Schema primitives (string, number, integer, boolean, object, array) plus three semantic types: date, email, and phone. Semantic types serialize to string with a JSON Schema format, so any downstream consumer that reads the schema continues to work. date fields take an output format on the field itself (ISO 8601, US, EU, or Long). Extract normalizes the value it reads from the document to that format, regardless of how the document writes the date. See date output formats in the schema design guide.

Unverified fields

When Field confidence is enabled, fields whose confidence score falls below an internal threshold are marked as unverified in the extraction result. You can pair this with a Review node configured with Pause mode = Unverified Fields to automatically queue runs for human review whenever the extraction is uncertain.

Bounding box overlay

After an extraction run completes, the document viewer displays colored bounding box overlays on the original document showing where each field value was found:
  • The overlay color indicates the confidence tier: green for high confidence, amber for medium, red for low.
  • Toggle the overlay on or off using the bbox toggle above the document viewer.
  • Switch between overlay view and table view to inspect all detected regions at once.
  • Hover over a bounding box to see the exact confidence score.
  • Click a bounding box to select the corresponding field in the side panel.

Inputs and outputs

Allowed inputs: Trigger nodes, route, parse, review. Output: Structured JSON data matching the configured schema.

Extract best practices

Engine choice, confidence patterns, and pitfalls

Schema design

How to write JSON Schema for DocPipe extractions

Review action

Pause runs with unverified fields for human review

Validation action

Enforce required fields and formats after extraction