Schema design

The extraction schema defines what structured data the AI extracts from your documents. A well-designed schema produces more accurate and consistent results.

This guide focuses on schema mechanics (types, nesting, required fields, examples). For broader extraction guidance (engine choice, confidence, review workflows), see extract best practices.

Schema structure

A schema uses JSON Schema Draft-07 format. The root must be an object type with properties defining each field:

{
  "type": "object",
  "properties": {
    "vendor_name": {
      "type": "string",
      "description": "Name of the vendor or supplier"
    },
    "total_amount": {
      "type": "number",
      "description": "Total amount due including tax"
    }
  },
  "required": ["vendor_name", "total_amount"]
}

The required array lists fields that should always be extracted. Fields not listed in required may return null if not found in the document.

Field types

The schema editor offers JSON Schema primitives plus a few semantic types that map to JSON Schema format annotations under the hood.

Type	JSON Schema serialization	Example value
`string`	`{ "type": "string" }`	`"Acme Corp"`
`number`	`{ "type": "number" }`	`1250.00`
`integer`	`{ "type": "integer" }`	`42`
`boolean`	`{ "type": "boolean" }`	`true`
`object`	`{ "type": "object", "properties": {...} }`	See below
`array`	`{ "type": "array", "items": {...} }`	See below
`date`	`{ "type": "string", "format": "date", "x-docpipe": { "outputFormat": "iso" } }`	`"2024-08-27"`
`email`	`{ "type": "string", "format": "email" }`	`"vendor@acme.com"`
`phone`	`{ "type": "string", "format": "phone" }`	`"+1 415 555 0100"`

Semantic types (date, email, phone) serialize to string with a format, so any downstream consumer that reads JSON Schema continues to work. The date type also stores the chosen output format in an x-docpipe extension.

Date output formats

When you pick the date type in the schema editor, you choose one of four output formats. The extract node normalizes the value it reads from the document into the format you select, so dates land in your pipeline output looking the same regardless of how the source document writes them (27/08/2024, Aug 27 2024, 2024-08-27, etc.).

Output format	`x-docpipe.outputFormat`	Example
ISO 8601	`iso`	`2024-08-27`
US	`us`	`08/27/2024`
EU	`eu`	`27/08/2024`
Long	`long`	`August 27, 2024`

Nested fields and arrays

For repeating data like line items, use the array type with an items object that defines the structure of each element:

{
  "line_items": {
    "type": "array",
    "description": "Individual line items on the invoice",
    "items": {
      "type": "object",
      "properties": {
        "description": { "type": "string", "description": "Item description" },
        "quantity": { "type": "number", "description": "Quantity ordered" },
        "unit_price": { "type": "number", "description": "Price per unit" },
        "amount": { "type": "number", "description": "Line total" }
      }
    }
  }
}

Note that items is a single schema object (not an array) describing the shape of each array element.

Table extraction

For documents with tabular data (invoices, purchase orders, statements), use arrays to capture table rows. The AI identifies table structures and maps columns to your defined fields. Tips for table extraction:

Name fields to match common column headers
Include a description that mentions the column header name if it differs from the field name
Test with documents that have varying table formats

Best practices

Use specific field names

Choose field names that clearly indicate the data. The field name is the first signal the AI uses when several values on the page could plausibly fit.

// Effective
"invoice_date": { "type": "string", "description": "Invoice issue date (ISO 8601)" }
"due_date":     { "type": "string", "description": "Payment due date (ISO 8601)" }
"ship_date":    { "type": "string", "description": "Date goods shipped (ISO 8601)" }

// Problematic
"date1": { "type": "string" }
"date2": { "type": "string" }
"date3": { "type": "string" }

Write descriptions that locate the value

Descriptions tell the AI where to look, not just what the field means. Anchor each description to a section heading, label, or visual region from the document.

// Effective
"total_amount": {
  "type": "number",
  "description": "Grand total in the bottom-right summary box, including tax and shipping"
}

// Problematic
"total_amount": {
  "type": "number",
  "description": "The total"
}

When a document has several plausible candidates (subtotal, tax, total, balance due), a locating description is the simplest fix.

Keep nesting shallow

Two levels (an object containing an array of objects) is fine. Deeper structures get fragile because the AI has to populate every level consistently.

// Effective: flat top level, line items at depth 2
{
  "vendor_name":  { "type": "string" },
  "total_amount": { "type": "number" },
  "line_items": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "description": { "type": "string" },
        "amount":      { "type": "number" }
      }
    }
  }
}

// Problematic: nesting added for organizational reasons only
{
  "header": {
    "type": "object",
    "properties": {
      "vendor": {
        "type": "object",
        "properties": {
          "identity": {
            "type": "object",
            "properties": {
              "name": { "type": "string" }
            }
          }
        }
      }
    }
  }
}

Reach for nesting only when the document itself has nested structure (line items, parties, addresses). Otherwise flatten.

Extract only what’s on the page

The extract node is a reader, not a calculator. Asking it to compute, infer, or normalize a value that isn’t visible in the document leads to fabricated answers.

// Problematic: derived values
"tax_rate":      { "type": "number", "description": "Tax rate as a decimal" }
"days_overdue":  { "type": "number", "description": "Days past due date" }
"total_in_usd":  { "type": "number", "description": "Total converted to USD" }

Extract the raw inputs (subtotal, tax_amount, due_date, currency) and compute the rest in a transform node, where the math is deterministic.

Add extraction instructions

Use the extract action’s Instructions field to provide context the AI can use:

“Dates should be in ISO 8601 format (YYYY-MM-DD)”
“If a field is not found in the document, return null”
“The total should include tax. If tax is listed separately, add it to the subtotal”

Start simple, iterate

Begin with a small number of high-value fields. Test with real documents, review the results, and gradually add more fields as you confirm accuracy.

Handle missing data

Not every document contains every field. The AI returns null for fields it cannot find. Design your downstream processing to handle missing values gracefully.

Example schemas

Invoice

{
  "type": "object",
  "properties": {
    "vendor_name": { "type": "string", "description": "Name of the vendor or company issuing the invoice" },
    "invoice_number": { "type": "string", "description": "Invoice or reference number" },
    "invoice_date": { "type": "string", "format": "date", "x-docpipe": { "outputFormat": "iso" }, "description": "Date the invoice was issued" },
    "due_date": { "type": "string", "format": "date", "x-docpipe": { "outputFormat": "iso" }, "description": "Payment due date" },
    "subtotal": { "type": "number", "description": "Subtotal before tax" },
    "tax_amount": { "type": "number", "description": "Total tax amount" },
    "total_amount": { "type": "number", "description": "Grand total including tax" },
    "currency": { "type": "string", "description": "Currency code (e.g., USD, EUR, GBP)" },
    "line_items": {
      "type": "array",
      "description": "Invoice line items",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string", "description": "Item or service description" },
          "quantity": { "type": "number", "description": "Quantity" },
          "unit_price": { "type": "number", "description": "Price per unit" },
          "amount": { "type": "number", "description": "Line total" }
        }
      }
    }
  },
  "required": ["vendor_name", "invoice_number", "total_amount"]
}

Receipt

{
  "type": "object",
  "properties": {
    "merchant_name": { "type": "string", "description": "Name of the store or merchant" },
    "transaction_date": { "type": "string", "format": "date", "x-docpipe": { "outputFormat": "iso" }, "description": "Date of purchase" },
    "total": { "type": "number", "description": "Total amount paid" },
    "payment_method": { "type": "string", "description": "Payment method used (cash, credit card, etc.)" },
    "items": {
      "type": "array",
      "description": "Purchased items",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string", "description": "Item name" },
          "price": { "type": "number", "description": "Item price" }
        }
      }
    }
  },
  "required": ["merchant_name", "total"]
}

Extract action

Configuration reference for the extract node

Extract best practices

Engine choice, confidence, and review patterns

Validation action

Enforce required fields and formats after extraction

Review action

Pause runs with unverified fields for human review

Getting started

Learn

Guides

Reference

Administration

Schema structure

Field types

Date output formats

Nested fields and arrays

Table extraction

Best practices

Use specific field names

Write descriptions that locate the value

Keep nesting shallow

Extract only what’s on the page

Add extraction instructions

Start simple, iterate

Handle missing data

Example schemas

Invoice

Receipt

Extract action

Extract best practices

Validation action

Review action

Getting started

Learn

Guides

Reference

Administration

Documentation Index

​Schema structure

​Field types

​Date output formats

​Nested fields and arrays

​Table extraction

​Best practices

​Use specific field names

​Write descriptions that locate the value

​Keep nesting shallow

​Extract only what’s on the page

​Add extraction instructions

​Start simple, iterate

​Handle missing data

​Example schemas

​Invoice

​Receipt

​Related

Extract action

Extract best practices

Validation action

Review action

Schema structure

Field types

Date output formats

Nested fields and arrays

Table extraction

Best practices

Use specific field names

Write descriptions that locate the value

Keep nesting shallow

Extract only what’s on the page

Add extraction instructions

Start simple, iterate

Handle missing data

Example schemas

Invoice

Receipt

Related