Architecture & models

Context Window

The maximum amount of text (in tokens) an LLM can consider in one inference.

Every LLM has a context window — a hard limit on how many tokens it can see at once. GPT-4o supports 128K tokens (~96,000 words), Claude Sonnet 200K, Gemini 1.5 Pro up to 2M.

Your context window has to fit: the system prompt + retrieved RAG chunks + chat history + the new user message + room for the response. Run out and the oldest messages get dropped (or you have to summarize).

Bigger windows are tempting but expensive: cost and latency both scale with input length.

Esempio in GlobalChatbot

GlobalChatbot manages context smartly: it always retains the system prompt and the user's last 8 turns, and uses RAG to surface only the most relevant 4–6 KB chunks per query.

Vedi in azione.

GlobalChatbot — agente AI per aziende serie. Configurazione in 5 minuti, 45 lingue, senza carta richiesta.

14 days · no card · cancel anytime