Skip to main content
The extract node uses AI to pull structured data from documents based on a schema you define. It is the core action node for document processing in Ingestly. You define the fields you want, and the AI extracts them from the document content.

When to use extract

  • You want structured fields out of a document: invoices, forms, contracts, statements, IDs.
  • You want the AI to handle OCR and field-finding in one step. Extract runs OCR internally; you don’t need a separate parse node in front of it.
  • You want a confidence signal per field for downstream review. Enable Field confidence with Engine 1.
  • Use parse instead when you only need raw text, or classify when the goal is to sort documents rather than read fields off them.
For schema and engine guidance, see extract best practices and schema design.

Configuration

FieldTypeRequiredDescription
EngineselectYesExtraction engine: Engine 1 or Engine 2
Schemaschema editorYesDefines the fields to extract (name, type, description)
PrecisionselectNoProcessing precision: Auto (default), Small, Medium, or High. Auto lets Ingestly pick the tier per document based on your schema and document content; Small, Medium, and High override it
Chunk StrategyselectNoHow document chunks are processed: Parallel (default) or Sequential
Extraction hintsstringNoNatural language instructions to guide the AI extraction
Field confidencetoggleNoWhen enabled, the AI returns a confidence score (0–1) for each extracted field. Available with Engine 1 only
Force OCRtoggleNoForces OCR processing on the document even when it contains selectable text. Useful for scanned PDFs with inaccurate text layers
OCR groundingtoggleNoUses OCR output to ground LLM extraction, improving accuracy on complex layouts. Available with Engine 1 only
JSON pathstringNoJSONPath expression to extract a subset of the output
Custom promptstringNoNatural language instructions for AI schema generation (max 2000 characters)

Engines

  • Engine 1: Faster, lower cost. Best for simple documents with clear structure.
  • Engine 2: More capable. Best for complex documents, handwriting, or ambiguous layouts.

Precision

Auto (default) picks the right tier automatically per document. Small, Medium, and High pin a fixed tier, where higher is more accurate but slower and costs more credits. With Auto the editor cannot show a single per-page cost, so the node displays a 1-6 credits/page range badge. The exact charge is shown after the run.

Chunk strategy

Long documents are split into chunks before extraction. The Chunk Strategy controls how those chunks are processed.
  • Parallel (default) runs chunks concurrently for the fastest wall-clock time.
  • Sequential runs chunks one at a time. It’s slower but can improve accuracy on long documents with tables or array-heavy schemas, where context from earlier chunks helps the model align rows in later ones.
Start with Parallel. Switch to Sequential if you see misaligned rows or duplicated array items on long documents.

Custom prompt for schema generation

When using Generate to let AI create a schema from a sample document, you can provide a custom prompt to specify which fields to extract. The AI will include only the fields described in your prompt, ignoring other visible data in the document. This is useful when a document contains many fields but you only need a subset.
You can click Generate with an empty schema or with only a few fields partially filled in. Ingestly no longer requires every field to be populated before running generation.

JSON path

Leave JSON path blank to keep the full extraction output. Set it when you want to narrow the result down to a specific section of the extracted JSON. For example, use it to pull a single nested array out and pass only that array to downstream nodes.

Schema field types

The schema editor accepts the JSON Schema primitives (string, number, integer, boolean, object, array) plus three semantic types: date, email, and phone. Semantic types serialize to string with a JSON Schema format, so any downstream consumer that reads the schema continues to work. date fields take an output format on the field itself (ISO 8601, US, EU, or Long). Extract normalizes the value it reads from the document to that format, regardless of how the document writes the date. See date output formats in the schema design guide.

Unverified fields

When Field confidence is enabled, fields whose confidence score falls below an internal threshold are marked as unverified in the extraction result. You can pair this with a Review node configured with Pause mode = Unverified Fields to automatically queue runs for human review whenever the extraction is uncertain.

Advanced duplicate-value disambiguation

A collapsible Advanced duplicate-value disambiguation section appears under the extract editor only when both Engine 1 is selected and OCR grounding is enabled. It tunes how the node handles fields that appear to have landed on the same source text.
  • Duplicate Verifier (toggle): runs a short follow-up step to disambiguate two fields that appear to have landed on the same source text.
  • Ensemble Verification (toggle): runs the extraction multiple times in parallel and keeps the most consistent result, for when accuracy matters more than throughput.
  • Higher-Tier Verifier (toggle, disabled until Duplicate Verifier is on): runs the disambiguation step one precision tier above the main extraction. It has no effect when the extraction already runs at the highest tier.
  • Max Fields per Verifier Call (number, 1 to 8, default 4): the maximum number of colliding fields the verifier resolves in a single call. Collisions above this count are skipped.

Bounding box overlay

After an extraction run completes, the document viewer displays colored bounding box overlays on the original document showing where each field value was found:
  • The overlay color indicates the confidence tier: green for high confidence, amber for medium, red for low.
  • Toggle the overlay on or off using the bbox toggle above the document viewer.
  • Switch between overlay view and table view to inspect all detected regions at once.
  • Hover over a bounding box to see the exact confidence score.
  • Click a bounding box to select the corresponding field in the side panel.

Linking schema fields to the document in the builder

When you open the extract node in the editor and select a document, the document preview panel shows the same bounding box overlays you see on the Runs page. Selecting a schema row in the editor draws a connector line from the row to every matching bbox on the document; for array items the line splits so each row in the document lights up. Clicking a bbox highlights the matching schema row and scrolls it into view. A small chip on each leaf row shows the match count, and a warning appears for fields that did not match anywhere in the document. If you edit the schema or prompt after the last run, the preview panel shows a stale-run banner so you know the overlays may not reflect the current configuration. Use test step to refresh them.

Inputs and outputs

Allowed inputs: Trigger nodes, route, parse, review. Output: Structured JSON data matching the configured schema.

Extract best practices

Engine choice, confidence patterns, and pitfalls

Schema design

How to write JSON Schema for Ingestly extractions

Review action

Pause runs with unverified fields for human review

Validation action

Enforce required fields and formats after extraction