1. Core Architecture Patterns
Most startup LLM architectures follow one of three patterns:
Pattern A: Direct API
Frontend → Backend → LLM API. Simplest pattern, works for most use cases. Add an abstraction layer when you need to switch providers.
Pattern B: Queue-Based
Frontend → Backend → Queue → Worker → LLM. Use when AI calls are slow or expensive and users can tolerate async responses.
Pattern C: Agent Loop
Backend → LLM → Tool Execution → LLM → ... Use for complex multi-step tasks. Requires careful timeout and cost management.
Start with Pattern A. Move to B or C only when you have a specific reason.
2. Prompt Management at Scale
Prompts are code. Treat them that way:
- Version control prompts: Store in your repo, not in the database or LLM provider dashboard.
- Template system: Use a templating library (Jinja2, Handlebars) for dynamic prompt assembly.
- Typed inputs: Define TypeScript/Python interfaces for prompt variables to catch errors at compile time.
- A/B testing: Build infrastructure to run multiple prompt versions in production with metrics.
3. Context Window Strategies
Context windows are your most precious resource:
- Measure usage: Log token counts for every call. Know where your context budget goes.
- Summarize aggressively: For long conversations, periodically summarize older messages into a single context block.
- Prioritize recency: Recent context usually matters more than old context. Implement sliding windows.
- Chunk intelligently: When splitting documents, overlap chunks by 10-20% to preserve context at boundaries.
4. Caching and Cost Optimization
LLM APIs are expensive. Caching is essential:
- Semantic caching: Cache based on prompt similarity, not exact match. Similar questions get similar answers.
- Prefix caching: When prompts share common prefixes, cache the prefix processing. Some providers support this natively.
- Response caching: For deterministic queries (temperature=0), cache the full response.
- Embedding caching: Embedding calls are cheaper than completions. Cache embeddings aggressively.
A well-implemented cache can reduce LLM costs by 40-70%.
5. RAG Architecture Decisions
Retrieval-Augmented Generation requires several decisions:
- Vector store: Pinecone for simplicity, pgvector for PostgreSQL shops, Weaviate for hybrid search.
- Chunking strategy: 500-1000 tokens with overlap works for most text. Adjust based on your content structure.
- Hybrid search: Combine vector similarity with keyword search. Neither alone is sufficient.
- Reranking: Retrieve more candidates than needed, then use a reranker model to pick the best.
6. Multi-Model Routing
Smart startups use different models for different tasks:
- GPT-4/Claude for complex: Reasoning, analysis, nuanced writing.
- GPT-3.5/Claude Haiku for simple: Classification, extraction, simple formatting.
- Specialized models: Use dedicated models for embeddings, image generation, speech.
- Fallback chains: When primary model fails or times out, fall back to alternatives.
Build a router abstraction from day one. You will change models more than you expect.