Skip to main content

Selectors, Extractors, and Validation Rules

SpecD schemas work with the content of your artifact files — not just their existence. To do that, schemas need a way to describe exactly which part of a file they care about: a heading, a list item, a YAML key, a specific sequence entry. That is what selectors do.

Selectors, extractors, and validation rules are three layers that build on each other:

  • A selector identifies one or more nodes in a file's parsed structure.
  • An extractor wraps a selector with post-processing instructions: what text to pull out, how to clean it up, and optionally how to map it into structured objects.
  • A validation rule wraps a selector with a structural assertion: does this node exist? does its content match a pattern?

All three are used across schemas — in validations, deltaValidations, and metadataExtraction. Understanding selectors is the foundation for all of them.


How artifacts are parsed

When SpecD reads an artifact file, it parses it into a tree of typed nodes — an Abstract Syntax Tree (AST). Each node has a type, a label (the identifying text), and a value (the content).

The node types available depend on the file's format:

FormatNode types
Markdowndocument, section, paragraph, list, list-item, code-block, thematic-break
JSONdocument, object, property, array, array-item
YAMLdocument, mapping, pair, sequence, sequence-item
Plain textdocument, paragraph, line

The label is what matches is tested against: the heading text for a Markdown section, the key name for a YAML pair or JSON property. The value is what contains is tested against: the paragraph text for a Markdown paragraph, the scalar value for a YAML pair.

Markdown example

Given this Markdown:

## Requirements

### Requirement: Authentication

Users must be able to log in with email and password.

The AST contains:

  • A section node with label: "Requirements" at level 2
    • A section node with label: "Requirement: Authentication" at level 3
      • A paragraph node with value: "Users must be able to log in with email and password."

YAML example

Given this YAML:

workflow:
- name: Run tests
command: pnpm test
- name: Deploy
command: ./deploy.sh

The AST contains:

  • A mapping at document root
    • A pair with label: "workflow"
      • A sequence containing:
        • A sequence-item (an object) with child pairs
          • A pair with label: "name", value: "Run tests"
          • A pair with label: "command", value: "pnpm test"
        • A sequence-item (an object) with child pairs
          • A pair with label: "name", value: "Deploy"
          • ...

Selectors

A selector is an object that describes the nodes to match. All selector fields are optional except type.

Selector fields

FieldTypeDescription
typestringThe node type to match. Required. Must be one of the type names for the target format (see table above).
matchesstringRegex matched case-insensitively against the node's label. Plain strings match anywhere in the label — use anchors for exact matches.
containsstringRegex matched case-insensitively against the node's value. Useful for finding leaf nodes by content rather than name.
parentselectorConstrains the search to nodes whose nearest ancestor matches this selector. Used to disambiguate nodes with the same label at different levels.
indexintegerFor array-item and sequence-item nodes: targets the item at this zero-based position. Mutually exclusive with where.
whereobjectFor array-item and sequence-item nodes that are objects: targets the item whose fields all match the given key–value pairs. Values are case-insensitive regexes. Mutually exclusive with index.
levelintegerFor Markdown section nodes only: matches sections at exactly this heading level (1 = #, 2 = ##, etc.).

Matching with matches

matches is a case-insensitive regular expression tested against the node's label. Without anchors, it matches anywhere in the label text.

PatternWhat it matches
'Login'Any label containing Login (e.g. Login, User Login, Login flow)
'^Requirements$'Exactly the string Requirements
'^Requirement:'Any label starting with Requirement:
'_url$'Any label ending with _url
'^Scenario: .+ \(skipped\)$'Labels like Scenario: Login (skipped)

Narrowing with parent

When the same heading appears at multiple levels of a document, parent disambiguates. Without parent, all matching nodes are returned regardless of where they appear.

# Matches every section whose label starts with "Requirement:"
# anywhere in the document — including sections nested under other headings
selector:
type: section
matches: '^Requirement:'

# Matches only those sections nested directly under the Requirements section
selector:
type: section
matches: '^Requirement:'
parent:
type: section
matches: '^Requirements$'

Targeting by position with index

For ordered collections (array-item, sequence-item), index selects a specific item by its zero-based position:

# The first item in a sequence
selector:
type: sequence-item
index: 0

Targeting by field values with where

For object-typed sequence or array items, where matches items whose named fields satisfy all conditions. Values are case-insensitive regexes:

# The sequence item whose "op" field matches "added" or "modified"
selector:
type: sequence-item
where:
op: 'added|modified'

# The sequence item whose "name" field is exactly "Run tests"
selector:
type: sequence-item
where:
name: '^Run tests$'

Targeting by heading level with level

In Markdown, level restricts a section selector to a specific heading depth:

# Only the H1 heading (the document title)
selector:
type: section
level: 1

# Only H3 headings — not H2 or H4
selector:
type: section
level: 3

Practical selector examples

These examples come directly from the standard schema (schema-std).

Selecting the document title (H1 heading)

selector:
type: section
level: 1

Matches the first (and typically only) H1 heading in a Markdown spec file.

Selecting a section by exact heading name

selector:
type: section
matches: '^Purpose$'

Matches a section whose heading is exactly Purpose. Case-insensitive, so purpose and PURPOSE also match.

Selecting all Requirement sections under Requirements

selector:
type: section
matches: '^Requirement:'
parent:
type: section
matches: '^Requirements$'

Matches every ### Requirement: ... heading that is a direct child of the ## Requirements section. Without the parent constraint, this would also match any Requirement: heading that appears elsewhere in the document.

Selecting list items within a section

selector:
type: list-item
parent:
type: section
matches: '^Constraints$'

Matches every bullet point under a ## Constraints heading.

Selecting a YAML sequence item by field value

selector:
type: sequence-item
where:
op: 'added|modified'

Matches every sequence item (object) whose op field is added or modified. Used in deltaValidations to assert that modified delta entries meet structural requirements.


Extractors: pulling data from the AST

An extractor pairs a selector with instructions for what to do with the matched nodes. Extractors are used in metadataExtraction to pull structured content out of artifact files.

Every extractor has a selector field plus a post-processing pipeline. The pipeline runs in this order:

  1. Select nodes using selector
  2. Extract text using extract
  3. Strip unwanted patterns using strip
  4. Capture a portion of the text using capture
  5. Transform the result using a named callback with transform
  6. Group results using groupBy

Or, for complex structured data:

  • Map to objects using fields

Extract modes

The extract field controls what text is pulled from each matched node:

ValueWhat it returns
'content'The full rendered text of the node's subtree (default). For a Markdown section, this is everything below the heading.
'label'Only the node's identifying text: the heading text, key name, or item label.
'both'The label followed by the content, separated.

Simple extraction: a section's content

extractor:
selector:
type: section
matches: '^Purpose$'
extract: content

Returns the full text of everything inside the Purpose section — paragraphs, lists, everything below the heading.

Extracting just the label

extractor:
selector:
type: section
level: 1
extract: label

Returns the heading text of the H1 section — the document title — without any of the section's content.

Stripping a prefix from labels

When labels contain a structured prefix you want to remove:

extractor:
selector:
type: section
matches: '^Requirement:'
parent:
type: section
matches: '^Requirements$'
extract: label
strip: '^Requirement:\s*'

Each matched label (Requirement: Authentication, Requirement: Session expiry) has the Requirement: prefix removed, leaving Authentication and Session expiry.

Capturing a portion of the text

capture is a regex with a capture group. Only the captured portion is kept:

extractor:
selector:
type: section
matches: '^Spec Dependencies$'
extract: content
capture: '(?:^|\n)\s*-\s+(?:\[`?|`)?([^`\]\n]+?)(?:(?:`?\]\(([^)]+)\)|`)|(?=\s*(?:—|$)))'
transform:
name: resolveSpecPath
args: ['$2']

This extracts dependency entries from the Spec Dependencies section, capturing the visible label as the primary value and the optional href as $2. resolveSpecPath first tries the label itself (for canonical entries like [`core:core/config`](../config/spec.md) or `core:core/config`), then falls back to the captured href when the label is legacy text.

Grouping results by label

groupBy: label groups matched nodes by their label (after strip is applied). This is useful when the same heading pattern repeats and you want the results keyed by name:

extractor:
selector:
type: section
matches: '^Requirement:'
parent:
type: section
matches: '^Requirements$'
groupBy: label
strip: '^Requirement:\s*'
extract: content

Instead of returning a flat array of content strings, this returns an object keyed by stripped label:

{
"Authentication": "Users must be able to log in...",
"Session expiry": "Sessions expire after 30 minutes..."
}

Structured extraction with fields

For complex data like BDD scenarios, a single matched node needs to produce a structured object with multiple named fields. The fields option does this: each key in fields declares how to populate one property of the output object.

Each field mapping can use:

FieldDescription
fromSource of the value: 'label' (the matched node's label), 'parentLabel' (the parent node's label), or 'content' (the rendered subtree).
childSelectorA selector applied within the matched node to find child nodes for this field.
captureRegex with a capture group applied to the extracted text.
stripRegex removed from the extracted text.
followSiblingsRegex matching sibling nodes that follow a childSelector match. Enables sequential grouping — see below.

Example: extracting BDD scenarios

The standard schema extracts structured scenario objects from verify.md files. Each scenario section (#### Scenario: ...) under a requirement section needs to produce an object with name, requirement, given, when, and then fields.

The verify.md looks like this:

### Requirement: Authentication

#### Scenario: Valid login

- GIVEN the user is on the login page
- WHEN the user enters valid credentials
- AND the user submits the form
- THEN the user is redirected to the dashboard
- AND a session cookie is set

The extractor:

extractor:
selector:
type: section
matches: '^Scenario:'
parent:
type: section
matches: '^Requirement:'
fields:
name:
from: label
strip: '^Scenario:\s*'
requirement:
from: parentLabel
strip: '^Requirement:\s*'
given:
childSelector: { type: list-item, matches: '^GIVEN\b' }
capture: '^GIVEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'
when:
childSelector: { type: list-item, matches: '^WHEN\b' }
capture: '^WHEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'
then:
childSelector: { type: list-item, matches: '^THEN\b' }
capture: '^THEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'

Each matched Scenario: section produces one object. For the example above:

{
"name": "Valid login",
"requirement": "Authentication",
"given": ["the user is on the login page"],
"when": ["the user enters valid credentials", "the user submits the form"],
"then": ["the user is redirected to the dashboard", "a session cookie is set"]
}

The followSiblings pattern

followSiblings enables collecting AND/OR continuation lines that follow a primary clause. When a childSelector matches a node (e.g. WHEN the user enters valid credentials), followSiblings matches the subsequent sibling nodes that belong to that clause (AND the user submits the form). The matched siblings are appended to the preceding field's result array.

If followSiblings contains a capture group, only the captured portion is used. Without a capture group, the full node text is returned. The collection stops when a sibling does not match followSiblings or when another field's childSelector matches.


Validation rules

Validation rules assert structural constraints on artifact content. They appear in validations (checked against the artifact) and deltaValidations (checked against delta files before application).

Validation rule fields

FieldTypeDescription
idstringOptional identifier for this rule. Useful for error messages and for targeting with overrides.
selectorselectorIdentifies the node(s) to validate.
pathstringJSONPath expression for targeting a value in JSON or YAML artifacts. Alternative to selector.
requiredbooleanWhether the matched node must exist. Defaults to true. When false, the rule produces a warning rather than a hard failure if the node is absent.
contentMatchesstringRegex the rendered node content must match.
childrenarrayNested validation rules evaluated against the matched node's children.

Asserting a section exists

validations:
- id: has-purpose
selector:
type: section
matches: '^Purpose$'
required: true

Fails if no Purpose section is found in the artifact.

Asserting nested structure

children rules are evaluated within each node matched by the parent rule. This is how you assert that a section not only exists but also contains the right structure:

validations:
- id: has-requirements
selector:
type: section
matches: '^Requirements$'
required: true
children:
- id: has-requirement-block
selector:
type: section
matches: '^Requirement:'
required: true
children:
- id: has-scenario
selector:
type: section
matches: '^Scenario:'
required: true

This enforces the full structure: a Requirements section must exist, it must contain at least one Requirement: subsection, and that subsection must contain at least one Scenario: subsection.

Asserting content with contentMatches

deltaValidations:
- id: added-has-scenario
selector:
type: sequence-item
where:
op: 'added|modified'
contentMatches: '#### Scenario:'
required: true

Every delta entry with op: added or op: modified must have content that includes a #### Scenario: heading. This is how the standard schema enforces that verification scenarios are included when specs change.

required: false for warnings

When required: false, the absence of a matching node produces a warning rather than blocking validation:

validations:
- id: has-spec-dependencies
selector:
type: section
matches: '^Spec Dependencies$'
required: false

Use this for recommended but not mandatory structure.


Metadata extraction configuration

metadataExtraction is a top-level field in schema.yaml that wires extractors to named metadata categories. SpecD uses these declarations to pull structured information from artifact files — for context compilation, impact analysis, and tooling.

The configuration is keyed by category name:

Scalar categories (single extractor entry):

CategoryDescription
titleThe spec title, typically the H1 heading.
descriptionA prose description, typically the Purpose section.
dependsOnDependency spec paths extracted from link references.
keywordsKeyword terms.

Array categories (one or more extractor entries):

CategoryDescription
rulesStructured rule groups extracted from the spec.
constraintsConstraint strings.
scenariosStructured scenario objects from the verify artifact.
contextAlways-included context content.

Each entry has three fields:

FieldRequiredDescription
idnoIdentifier for this entry. Used by schemaOverrides to target the entry for modification.
artifactyesThe artifact type ID to extract from (e.g. 'specs', 'verify').
extractoryesThe extraction configuration — a selector plus post-processing fields.

Complete example: the standard schema's metadataExtraction

This is the full metadataExtraction configuration from schema-std, with explanations:

metadataExtraction:
# Pull the spec title from the H1 heading — label only, no content
title:
artifact: specs
extractor:
selector: { type: section, level: 1 }
extract: label

# Pull the spec description from the Purpose section's content
description:
artifact: specs
extractor:
selector: { type: section, matches: '^Purpose$' }
extract: content

# Extract dependency spec paths from Markdown links in Spec Dependencies
# capture pulls only the link targets, transform normalises them to spec IDs
dependsOn:
artifact: specs
extractor:
selector: { type: section, matches: '^Spec Dependencies$' }
extract: content
capture: '(?:^|\n)\s*-\s+(?:\[`?|`)?([^`\]\n]+?)(?:(?:`?\]\(([^)]+)\)|`)|(?=\s*(?:—|$)))'
transform:
name: resolveSpecPath
args: ['$2']

# Extract all Requirement sections, keyed by requirement name
# strip removes the "Requirement: " prefix from each key
rules:
- id: spec-requirements
artifact: specs
extractor:
selector:
type: section
matches: '^Requirement:'
parent: { type: section, matches: '^Requirements$' }
groupBy: label
strip: '^Requirement:\s*'
extract: content

# Extract bullet points from the Constraints section
constraints:
- id: spec-constraints
artifact: specs
extractor:
selector:
type: list-item
parent: { type: section, matches: '^Constraints$' }
extract: label

# Extract structured scenario objects from the verify artifact
scenarios:
- id: verify-scenarios
artifact: verify
extractor:
selector:
type: section
matches: '^Scenario:'
parent: { type: section, matches: '^Requirement:' }
fields:
name: { from: label, strip: '^Scenario:\s*' }
requirement: { from: parentLabel, strip: '^Requirement:\s*' }
given:
childSelector: { type: list-item, matches: '^GIVEN\b' }
capture: '^GIVEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'
when:
childSelector: { type: list-item, matches: '^WHEN\b' }
capture: '^WHEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'
then:
childSelector: { type: list-item, matches: '^THEN\b' }
capture: '^THEN\s+(.+)'
followSiblings: '^(?:AND|OR)\b'

# Always include the Purpose section in compiled context
context:
- id: spec-overview
artifact: specs
extractor:
selector: { type: section, matches: '^Purpose$' }
extract: content

Where to go next