Structure documents for AI consumption, not just human reading

Most institutional documents were written for human readers. They use complex formatting, embedded images, headers and footers, branded covers, proprietary file formats, track-changes markers, comments, and styling that signals status or context at a glance. A person reading the document uses all of that. An AI reading the document has to strip it away before the substance becomes usable, and what it strips is noisy, lossy, and expensive.

The working rule is that documents intended to be AI-accessible should be structured for AI consumption, not for human reading alone. In practice, that means plain-text formats with minimal, explicit structure — Markdown being the current default, because it is human-readable, AI-readable, version-controllable, searchable, and trivially convertible into presentation formats when human reading is needed.

What “structure for AI” actually means

Four characteristics separate AI-friendly documents from human-first ones.

Plain text as the substrate. No proprietary formats, no rich-text encoding, no layout-as-meaning. The semantic content and the formatting are separable, and the formatting does not carry information the AI cannot recover from the text alone.

Explicit structure. Headings that mean what they say. Lists that are lists. Tables that are tables. Metadata at the top of the file, not inferred from position on the page. The goal is that the document is unambiguously parseable without visual interpretation.

Small, focused files rather than monoliths. An AI working with a 200-page master document struggles to locate the relevant passage; the same content split into focused documents with clear titles is far more useful. This is a point about retrieval as much as about format.

Consistent terminology and structure across the corpus. The same concept referred to by different names in different documents creates friction for retrieval and reasoning; so does identical headings meaning different things in different contexts. Consistency is cheap to produce at authoring time and expensive to retrofit, which means the standards for terminology, document structure and metadata pay off disproportionately once AI is operating over the whole body of material.

Three properties of an AI-retrievable document store

The four document-level characteristics above are necessary but not sufficient. Once AI is retrieving across a whole document store, three corpus-level properties also matter, and they compound — getting two right without the third still leaves retrieval fragile.

Uniqueness. For any single canonical document — a current policy, a current procedure, a current form — the search should return exactly one authoritative hit. The common failure mode is the same document appearing in three to five places at different vintages, with no signal about which is current. Consolidation at the corpus level fixes this; better search alone does not.

Filterability by intent. When a query lands in the corpus, the AI needs to narrow fast: is this a current policy or an archived one, an approved version or a draft, an organisational document or a personal copy. This is what consistent metadata and sensitivity labels are for. Without them the AI is left to infer from semantic content alone, and it will surface old or personal-copy versions alongside current ones because the surface text often looks similar.

Predictable location. If retrieval can expect a current policy to live at a specific location and there is a discoverable convention for that, search can be directed rather than purely semantic. Directed retrieval is both faster and more accurate than semantic ranking across a noisy corpus.

The single highest-leverage corpus-level move is also the cheapest. For the top fifty canonical documents, ask the owner to write a two-sentence summary in a metadata field that the search engine and the AI can read without opening the document. This is roughly a one-day exercise that pays dividends forever, because every retrieval over those documents now starts with a curator-written description of what the document actually contains rather than with a fragment of header text the search engine happened to match.

The outbound case

The same heuristic applies to client deliverables, not only to internal knowledge. A growing share of inbound material at mid-tier firms is first read by a client’s AI before any human on their side engages — see The first reader is an AI — and when that is the consumption pattern, the deliverable has to survive it. A twenty-page memo with the critical caveats buried in footnotes will be summarised badly. A deck that carries its meaning in visual design rather than in explicit text will lose its substance when it is abstracted. A recommendation spread across paragraphs an AI will compress may arrive at the human reader as a different recommendation than the one that was sent.

The practical response is the same one the internal-knowledge case calls for — cleaner document structure, explicit headings, content that does not rely on visual design to carry meaning — applied now to work the firm is sending out rather than only to work it holds internally. For consequential deliverables, a machine-readable version alongside the formatted one is worth the small additional effort, because it puts the firm’s own framing into the client’s AI context rather than leaving the AI to reconstruct it from the formatted version.

How to apply the heuristic

Not every document in an organisation needs to be converted. The high-leverage targets are the documents that will repeatedly sit in AI context: frequently-referenced policies, current client context, methodology and practice standards, worked examples, decision history. Marketing collateral and historical archives can stay as they are.

A more recent instance of the heuristic is the AI skill bundle — a packaged corpus built to be loaded into an AI’s working context. The bundle’s shape rehearses the same rules: a metadata-bearing entry file, separated reference files small enough to be retrieved selectively, canonical assets attached rather than re-typed, explicit registry tables where multiple alternatives need disambiguation. The format works because it was designed for AI traversal from the start; the same discipline applied retroactively to an existing knowledge corpus produces equivalent results, with more friction.

The broader implication is that format is a knowledge management variable, not a cosmetic one. The same content in two different forms produces different AI utility. Firms investing in knowledge management without thinking about format are leaving most of the gain on the table (see Useful AI is a context problem).