9 min read

Prompt Engineering Beyond the Basics: Patterns That Scale

Advanced prompt engineering techniques I use in production, from chain-of-thought to structured output, few-shot calibration, and self-consistency.

AI
LLMs
Prompt Engineering

Most prompt engineering advice stops at "be specific" and "give examples." That gets you 60% of the way. This post covers the patterns that handle the other 40%, the techniques I rely on for production AI systems where reliability matters.

Chain-of-Thought Is Not Optional

For any task requiring reasoning, such as classification with edge cases, multi-step calculations, or code generation, explicitly instructing the model to think step-by-step before answering improves accuracy dramatically. But the real trick is structured chain-of-thought: tell the model exactly what steps to reason through.

text
You are a code review assistant. For each code snippet:

1. IDENTIFY the programming language and framework
2. LIST any bugs or logic errors (be specific about line references)
3. ASSESS security vulnerabilities (injection, auth, data exposure)
4. SUGGEST performance improvements with estimated impact
5. RATE overall quality: Critical / Needs Work / Acceptable / Good

Think through each step before providing your final assessment.
Format your response as JSON matching this schema:
{
  "language": string,
  "bugs": [{"line": number, "issue": string, "severity": string}],
  "security": [{"type": string, "description": string}],
  "performance": [{"suggestion": string, "impact": "low"|"medium"|"high"}],
  "rating": string
}

Few-Shot Calibration

Examples in prompts aren't just for showing format. They calibrate the model's behavior. Choose examples that cover edge cases and boundary conditions, not just happy paths. Three well-chosen examples beat ten generic ones. I always include at least one example where the correct answer is "I don't know" or "not applicable" to prevent the model from always producing an answer.

Self-Consistency: Ask Three Times

For high-stakes outputs, I run the same prompt 3-5 times with temperature > 0 and take the majority answer. This is called self-consistency, and it catches the ~15% of cases where a single generation might hallucinate or reason incorrectly. The cost is linear, but the reliability gain is worth it for classification and extraction tasks.

Structured Outputs Are a Superpower

Always force structured output when possible:

  • JSON mode or function calling eliminates parsing headaches
  • Define explicit schemas. The model follows constraints better when they're formal
  • Include enums for categorical fields to prevent creative interpretations
  • Add a "confidence" field. Models are surprisingly well-calibrated when you ask them to self-assess

The Meta-Lesson

Prompt engineering is software engineering. Version your prompts, test them against eval sets, measure regressions, and treat them as code that gets reviewed. The teams shipping reliable AI products are the ones treating prompts with the same rigor as any other critical system component.