The extraction schema defines what structured data the AI extracts from your documents. A well-designed schema produces more accurate and consistent results.Documentation Index
Fetch the complete documentation index at: https://docs.docpipe.ai/llms.txt
Use this file to discover all available pages before exploring further.
Schema structure
A schema uses JSON Schema Draft-07 format. The root must be anobject type with properties defining each field:
required array lists fields that should always be extracted. Fields not listed in required may return null if not found in the document.
Field types
The schema editor offers JSON Schema primitives plus a few semantic types that map to JSON Schemaformat annotations under the hood.
| Type | JSON Schema serialization | Example value |
|---|---|---|
string | { "type": "string" } | "Acme Corp" |
number | { "type": "number" } | 1250.00 |
integer | { "type": "integer" } | 42 |
boolean | { "type": "boolean" } | true |
object | { "type": "object", "properties": {...} } | See below |
array | { "type": "array", "items": {...} } | See below |
date | { "type": "string", "format": "date", "x-docpipe": { "outputFormat": "iso" } } | "2024-08-27" |
email | { "type": "string", "format": "email" } | "vendor@acme.com" |
phone | { "type": "string", "format": "phone" } | "+1 415 555 0100" |
date, email, phone) serialize to string with a format, so any downstream consumer that reads JSON Schema continues to work. The date type also stores the chosen output format in an x-docpipe extension.
Date output formats
When you pick thedate type in the schema editor, you choose one of four output formats. The extract node normalizes the value it reads from the document into the format you select, so dates land in your pipeline output looking the same regardless of how the source document writes them (27/08/2024, Aug 27 2024, 2024-08-27, etc.).
| Output format | x-docpipe.outputFormat | Example |
|---|---|---|
| ISO 8601 | iso | 2024-08-27 |
| US | us | 08/27/2024 |
| EU | eu | 27/08/2024 |
| Long | long | August 27, 2024 |
Nested fields and arrays
For repeating data like line items, use thearray type with an items object that defines the structure of each element:
items is a single schema object (not an array) describing the shape of each array element.
Table extraction
For documents with tabular data (invoices, purchase orders, statements), use arrays to capture table rows. The AI identifies table structures and maps columns to your defined fields. Tips for table extraction:- Name fields to match common column headers
- Include a description that mentions the column header name if it differs from the field name
- Test with documents that have varying table formats
Best practices
Use specific field names
Choose field names that clearly indicate the data. The field name is the first signal the AI uses when several values on the page could plausibly fit.Write descriptions that locate the value
Descriptions tell the AI where to look, not just what the field means. Anchor each description to a section heading, label, or visual region from the document.Keep nesting shallow
Two levels (an object containing an array of objects) is fine. Deeper structures get fragile because the AI has to populate every level consistently.Extract only what’s on the page
The extract node is a reader, not a calculator. Asking it to compute, infer, or normalize a value that isn’t visible in the document leads to fabricated answers.subtotal, tax_amount, due_date, currency) and compute the rest in a transform node, where the math is deterministic.
Add extraction instructions
Use the extract action’s Instructions field to provide context the AI can use:- “Dates should be in ISO 8601 format (YYYY-MM-DD)”
- “If a field is not found in the document, return null”
- “The total should include tax. If tax is listed separately, add it to the subtotal”
Start simple, iterate
Begin with a small number of high-value fields. Test with real documents, review the results, and gradually add more fields as you confirm accuracy.Handle missing data
Not every document contains every field. The AI returnsnull for fields it cannot find. Design your downstream processing to handle missing values gracefully.
Example schemas
Invoice
Receipt
Related
Extract action
Configuration reference for the extract node
Extract best practices
Engine choice, confidence, and review patterns
Validation action
Enforce required fields and formats after extraction
Review action
Pause runs with unverified fields for human review