The extract node uses AI to pull structured data from documents based on a schema you define. It is the core action node for document processing in DocPipe. You define the fields you want, and the AI extracts them from the document content.Documentation Index
Fetch the complete documentation index at: https://docs.docpipe.ai/llms.txt
Use this file to discover all available pages before exploring further.
When to use extract
- You want structured fields out of a document: invoices, forms, contracts, statements, IDs.
- You want the AI to handle OCR and field-finding in one step. Extract runs OCR internally; you don’t need a separate parse node in front of it.
- You want a confidence signal per field for downstream review. Enable Field confidence with Engine 1.
- Use parse instead when you only need raw text, or classify when the goal is to sort documents rather than read fields off them.
Configuration
| Field | Type | Required | Description |
|---|---|---|---|
| Engine | select | Yes | Extraction engine: Engine 1 or Engine 2 |
| Schema | schema editor | Yes | Defines the fields to extract (name, type, description) |
| Precision | select | No | Processing precision: Small, Medium, or High |
| Chunk Strategy | select | No | How document chunks are processed: Parallel (default) or Sequential |
| Extraction hints | string | No | Natural language instructions to guide the AI extraction |
| Field confidence | toggle | No | When enabled, the AI returns a confidence score (0–1) for each extracted field. Available with Engine 1 only |
| Force OCR | toggle | No | Forces OCR processing on the document even when it contains selectable text. Useful for scanned PDFs with inaccurate text layers |
| OCR grounding | toggle | No | Uses OCR output to ground LLM extraction, improving accuracy on complex layouts. Available with Engine 1 only |
| JSON path | string | No | JSONPath expression to extract a subset of the output |
| Custom prompt | string | No | Natural language instructions for AI schema generation (max 2000 characters) |
Engines
- Engine 1: Faster, lower cost. Best for simple documents with clear structure.
- Engine 2: More capable. Best for complex documents, handwriting, or ambiguous layouts.
Chunk strategy
Long documents are split into chunks before extraction. The Chunk Strategy controls how those chunks are processed.- Parallel (default) runs chunks concurrently for the fastest wall-clock time.
- Sequential runs chunks one at a time. It’s slower but can improve accuracy on long documents with tables or array-heavy schemas, where context from earlier chunks helps the model align rows in later ones.
Custom prompt for schema generation
When using Generate to let AI create a schema from a sample document, you can provide a custom prompt to specify which fields to extract. The AI will include only the fields described in your prompt, ignoring other visible data in the document. This is useful when a document contains many fields but you only need a subset.JSON path
Leave JSON path blank to keep the full extraction output. Set it when you want to narrow the result down to a specific section of the extracted JSON. For example, use it to pull a single nested array out and pass only that array to downstream nodes.Schema field types
The schema editor accepts the JSON Schema primitives (string, number, integer, boolean, object, array) plus three semantic types: date, email, and phone. Semantic types serialize to string with a JSON Schema format, so any downstream consumer that reads the schema continues to work.
date fields take an output format on the field itself (ISO 8601, US, EU, or Long). Extract normalizes the value it reads from the document to that format, regardless of how the document writes the date. See date output formats in the schema design guide.
Unverified fields
When Field confidence is enabled, fields whose confidence score falls below an internal threshold are marked as unverified in the extraction result. You can pair this with a Review node configured with Pause mode = Unverified Fields to automatically queue runs for human review whenever the extraction is uncertain.Bounding box overlay
After an extraction run completes, the document viewer displays colored bounding box overlays on the original document showing where each field value was found:- The overlay color indicates the confidence tier: green for high confidence, amber for medium, red for low.
- Toggle the overlay on or off using the bbox toggle above the document viewer.
- Switch between overlay view and table view to inspect all detected regions at once.
- Hover over a bounding box to see the exact confidence score.
- Click a bounding box to select the corresponding field in the side panel.
Inputs and outputs
Allowed inputs: Trigger nodes, route, parse, review. Output: Structured JSON data matching the configured schema.Related
Extract best practices
Engine choice, confidence patterns, and pitfalls
Schema design
How to write JSON Schema for DocPipe extractions
Review action
Pause runs with unverified fields for human review
Validation action
Enforce required fields and formats after extraction