Context Window
The maximum amount of text (in tokens) an LLM can consider in one inference.
Every LLM has a context window — a hard limit on how many tokens it can see at once. GPT-4o supports 128K tokens (~96,000 words), Claude Sonnet 200K, Gemini 1.5 Pro up to 2M.
Your context window has to fit: the system prompt + retrieved RAG chunks + chat history + the new user message + room for the response. Run out and the oldest messages get dropped (or you have to summarize).
Bigger windows are tempting but expensive: cost and latency both scale with input length.
Esempio in GlobalChatbot
GlobalChatbot manages context smartly: it always retains the system prompt and the user's last 8 turns, and uses RAG to surface only the most relevant 4–6 KB chunks per query.
Vedi in azione.
GlobalChatbot — agente AI per aziende serie. Configurazione in 5 minuti, 45 lingue, senza carta richiesta.
14 days · no card · cancel anytime