Skip to main content

The life cycle of an AI model

A Large Language Model (LLM) undergoes two main phases:
  1. The training phase
    • The model is trained on large data sets
  2. The usage phase.
    • The model can be used to generate an answer
    • The model can not learn anymore
Breakdown of the Lifecycle of an AI model into training- and usage-phase

Training an LLM

What is a Token? A token is a piece of text (roughly a word or word fragment) that the model processes. On average, 1 token equals about 4 characters. For example, “Hello world” is 2 tokens, while “understanding” might be split into 2 tokens: “under” and “standing”.
During training, the model processes vast amounts of text data using a technique called “next token prediction.” The model learns statistical relationships between words and concepts by repeatedly predicting what word should come next in a sequence. Example of one of the tasks the models go through in the training phase, in this case filling out a text with missing words For example, given the text “The capital of Germany is ___”, the model learns that “Berlin” has a high probability of being the next token. Through billions of these predictions across diverse text, the model builds a sophisticated understanding of language patterns, facts, and reasoning. Once training completes, the model’s parameters are frozen. The “knowledge cutoff date” marks when training data collection stopped, meaning the model has no knowledge of events after this date. Now let’s explore how these trained models actually generate responses.

Using an LLM

What is Inference? Inference is the phase when a trained AI model generates responses to your prompts. Unlike training (when the model learns), during inference the model uses its existing knowledge to predict and generate text. The model cannot learn new information during this phase.
During the usage phase (also known as inference), the model generates responses by sampling from the probability distributions it learned during training. When you ask about Artificial Intelligence, the model assigns much higher probability to related terms like machine learning than unrelated ones like banana cake. Example of how the model generates each word based on the previous words When a user sends a prompt to the model, the model will choose the next word or word-piece (token) based on these probabilities. For example, when a user sends Hi, the model assigns high probability to greeting tokens, so it generates Hello as the response. Then, it generates the next most likely word based on Hi Hello. This process is repeated until the model decides the request was sufficiently answered. The generation process works token by token:
  1. User sends: Hi
  2. Model predicts high probability for greeting tokens like Hello
  3. Model then predicts the next token based on Hi Hello
  4. This continues until the model generates an end-of-sequence token

Influencing the output of a response

What is a Context Window? The context window is the maximum amount of text (measured in tokens) that an AI model can process in a single request. Think of it as the model’s “working memory” - everything you want the model to consider (your current message, chat history, attached documents, instructions) must fit within this limit.
Since deployed models cannot learn after being deployed, how do they remember previous messages or incorporate new information? The answer lies in the context window. Anatomy of a Message sent to the model as part of the context window Each request to the model includes everything needed for that specific response: your current message, the entire chat history, attached documents, system instructions, and any relevant knowledge base content. This complete context gets packed into the model’s context window (the maximum amount of text it can process in a single request). The model treats each request as completely independent, but by including all relevant context, it can maintain coherent conversations and reference previous information.