What data do I need to give an AI agent for it to be useful?
For most small-business AI agents the answer is less than people assume, usually a few hundred pages of structured-enough docs (SOPs, FAQs, customer history, price book) is plenty. The model handles general knowledge. Your data is for the parts the model can’t guess: your terminology, your prices, your customer history, your specific rules.
A frequent mistake is to assume an AI agent needs the same kind of training set as a machine learning model, millions of examples. That was true in 2018; it isn’t in 2026. Foundation models bring the general intelligence; you bring the specific knowledge. The minimum useful corpus for a small-business agent is roughly anything that would let a new employee do the job, a written SOP, an FAQ, the price book, a sample of past customer interactions.
In practice the assets that matter most are: (1) any document that answers "what do we do when...", these are the rule-bearing docs. (2) Your last 6–12 months of customer interactions (emails, transcripts) for tone calibration and edge cases. (3) Your structured data, price book, service catalog, hours, locations, in any format. (4) Your boundaries, what the agent should refuse to do, route to a human, or flag for review. Most small businesses already have some version of (1)–(3); (4) is usually new and worth writing down.
What you don’t need: a labeled dataset, training infrastructure, or a data scientist. The "training" in 2026 happens at the prompt and retrieval layer, not in the model.
Key facts
- Typical useful corpus for a small-business agent: 100–500 pages of mixed docs.
- Customer interaction history (last 6–12 months) calibrates tone and edge-case handling.
- Boundaries, what to refuse, what to escalate, are usually the most-overlooked input.
- Structured data (price books, service catalogs) often matters more than long prose.
Common follow-ups
What if my docs are bad?
They probably are, most small-business docs are scattered or stale. The build often includes a doc-cleanup phase. Sometimes the cleanup itself is the most valuable part of the engagement; AI exposes the gaps your team has been working around.
Will my data be used to train the model?
Not on the paid API tiers of OpenAI, Anthropic, or Google as of 2026, where zero-data-retention is the default for paid customers. On the consumer tiers (free ChatGPT, free Claude.ai) terms vary; check before pasting sensitive content.
Sources
Related answers
- What is RAG (retrieval-augmented generation) in plain English? →
- Is custom AI safe for HIPAA, finance, and other regulated workflows? →
- How do I know if my small business actually needs custom AI? →
- What does "custom AI" mean for a small business? →
Want a website built to be cited by Google and AI answer engines? Drop your URL, if it’s a fit, we’ll rebuild it for free.
See if you’re a fit →