LLMs are the most talked-about technology in software right now. Most people interact with them only through chat interfaces (e.g., ChatGPT). But understanding the mechanics underneath changes how you use them and how to set realistic expectations for what they can and can’t do.

Earlier this month, I gave a talk at the covering all of it. This article is the written companion.

The Evolution of Natural Language Processing

Before LLMs, computers trying to understand language went through several distinct eras, each solving problems the previous one couldn’t.

  1. Rule-Based

    1950s-1980s

    Handwritten Rules

  2. Statistical

    1990s-2012

    n-grams, SVMs

  3. Deep Learning

    2013-2018

    Embeddings, RNNs, Attention

  4. LLMs

    2019-Present

    BERT, GPT, LLaMA

The evolution of Natural Language Processing

Rule-based (1950s-1980s). Linguists handwrote if-this-then-that rules to parse sentences. A rule for pluralization, another for verb tense, another for every grammatical edge case. Natural language has more edge cases than anyone can enumerate, so these systems were brittle by design and exhausting to maintain.

Statistical (1990s-early 2010s). Machine learning replaced explicit rules with probabilistic models trained on data. Hidden Markov Models scored likely word sequences for speech recognition, n-grams predicted words from how often they appeared near each other, and support vector machines classified text based on statistical features. This scaled better than rules, but it learned surface co-occurrence rather than meaning.

Deep learning (2013-2018). Word embeddings represented every word as a vector of numbers in a high-dimensional space, where words with related meanings landed close to each other. “King” sat near “Queen”, “Cat” near “Dog”, “Walking” near “Running”. This gave models access to semantic relationships that rules and statistics couldn’t reach. RNNs and LSTMs then layered sequential understanding on top, processing text word-by-word with a hidden state that carried context forward. But RNNs read one token at a time, which made them slow to train since they couldn’t be properly parallelized. Their fixed-size hidden state also lost track of information across long sequences.

Transformers

The attention mechanism was introduced to language models for machine translation in 2014 by Bahdanau et al. , improving alignment between source and translated text. Instead of compressing an entire source sentence into a single fixed vector, attention let the decoder look back at every source token and decide which ones mattered for each output word.

Three years later, Google’s Attention Is All You Need paper pushed the idea to its extreme, throwing out RNNs entirely and building the Transformer architecture around self-attention, which let every token attend to every other token in a sequence in parallel. Combined with positional encoding, Transformers scaled in ways RNNs never could. Every modern LLM, from GPT to Claude to LLaMA, is built on this foundation.

If you want to go deeper on how self-attention works inside a Transformer, I recommend .

How LLMs Work

Training an LLM happens in three main stages, and each one costs a fraction of the step before it.

Pre-Training

Learn language patterns from massive text data

SFT

Fine-tune on tasks with labeled data

RLHF

Refine with human feedback as reward

ICL or Fine-Tuning

Adapt via prompts/RAG or further training

* Pre-Training: Most expensive step (by far)

How LLMs are trained, from raw text to usable model

Pre-training. At this stage, the model learns by playing fill-in-the-blank on massive amounts of raw text (books, websites, code, scientific papers). Words are hidden from sentences and the model has to predict them, over and over, trillions of times. For a sense of scale, , the open web archive that GPT-3 and most modern LLMs pull from, holds petabytes of raw web text. This is self-supervised learning at scale, where the labels come from the data itself. By the end, the model has a statistical representation of how language works, but it can only autocomplete, it doesn’t know how to follow instructions yet. Pre-training a frontier model costs tens of millions of dollars in GPU time, which is why this is by far the most expensive step in the pipeline.

Supervised Fine-Tuning (SFT). The pre-trained model is further trained on a curated dataset of prompt-response pairs. For LLMs, this is usually instruction tuning: teaching the model to answer questions, write code, summarize documents, and behave like an assistant. Far cheaper than pre-training, but still non-trivial to do well.

Reinforcement Learning from Human Feedback (RLHF). The alignment step. Human raters compare pairs of model outputs and pick the better one, and those preferences train a separate reward model. The reward model then steers the LLM via reinforcement learning toward responses people prefer. RLHF is what gets a model closer to what Anthropic calls “helpful, harmless, and honest.” Without SFT and RLHF, base models will happily reproduce toxic or misleading patterns from their training data, because nothing has explicitly penalized those outputs.

The ML community uses “Shoggoth with a smiley mask” as a meme to describe LLMs’ training process. The pre-trained model is drawn as a sprawling Lovecraftian monster, with SFT as a small pink face grafted onto one of its tentacles and RLHF as the smiley mask held up in front. The pipeline really does stack like that. The polished assistant you get to interact with is a thin layer of refinement sitting on top of a much larger and weirder pre-trained model.

The Shoggoth with a smiley mask meme. A Lovecraftian monster labeled 'Unsupervised Learning', a small pink face labeled 'Supervised Fine-tuning', and a smiley mask labeled 'RLHF (cherry on top)'.

The Shoggoth with a smiley mask, the ML community’s favorite metaphor for LLM alignment

Next-Word Prediction

Under the hood, text generation is simpler than it sounds. Given a sequence of tokens, the model calculates a probability distribution over its entire vocabulary for what comes next, driven by what the model learned from its training data, the preferences installed during SFT and RLHF, and the tokens currently in its context window.

The sky is

blue
47%
clear
24%
usually
12%
the
9%
falling
5%
LLMs predict the next word by calculating probability distributions

For the prompt “The sky is”, the model might assign 47% probability to “blue”, 24% to “clear”, and smaller weights to options like “usually”, “the”, and “falling”. The model picks a token, appends it to the sequence, and repeats. That’s it. Every response you’ve ever gotten from an LLM was generated one token at a time, each one conditioned on everything before it, including the tokens the model itself just generated. The sophistication comes from the model’s ability to maintain coherent context over thousands of tokens while doing this.

Tokens vs Words

LLMs don’t operate on words directly. They use tokens, which are sub-word units typically created through a process called Byte-Pair Encoding (BPE). Common words like “the” are single tokens, but longer or rarer words get split into pieces. “Tokenization” might become “Token” + “ization”. This is why LLMs struggle with tasks like counting letters in a word or working with unusual character patterns. They don’t see individual characters the way we do.

You can see this in action using the playground, which visualizes how GPT’s tokenizer breaks text apart.

Limitations

Each of the three forces driving next-token prediction has hard edges. The training data has gaps and a cutoff date, post-training can’t catch every bad output, and the context window is finite. The practical limitations below all trace back to one of those.

Hallucinations. LLMs generate plausible-sounding text, but they’re not designed to prioritize factual accuracy. They produce what statistically follows from the prompt, not what’s true. Ask an LLM how many ‘m’s are in the word “Weather” and it might confidently tell you there’s one, then apologize and correct itself when challenged. The model isn’t lying, it’s doing math on token probabilities, not reasoning about spelling.

Limited context windows. Every LLM has a maximum amount of text it can process at once. In early 2024, GPT-4 Turbo’s window is 128K tokens, Claude 2.1 offers 200K. That may sound like a lot until you’re building an application that needs to process long documents or maintain extended conversations.

Outdated knowledge. LLMs know what was in their training data, and nothing else. A model trained on data up to April 2023 has no idea about events after that date.

Limited to generating text. An LLM can write a perfect email, but it can’t send it. It can describe a database query, but it can’t execute one. Turning that text into action requires additional engineering.

Working Around the Limitations

Each limitation has a practical workaround. Training data and post-training preferences are locked in once the model ships, so every technique below comes down to shaping the one force you still control, the context the model is generating into.

Prompt Engineering

The most direct way to shape the context. A vague prompt gets a vague response. A specific prompt with clear constraints, role-setting, and format requirements gets better output.

Bad prompt: “Suggest some books.” gets you a generic list of classics.

Good prompt: “Recommend contemporary novels that weave themes of AI ethics and human dilemmas, preferably from authors recognized for their deep exploration of technology’s impact on society.” gets you targeted recommendations.

Constraints can backfire, since over-specifying steers the model away from a better answer it would have found with more room to work.

Few-Shot Prompting

Instead of explaining what you want in the abstract, show the model examples. One-shot prompting gives a single example. Few-shot prompting provides two to five examples of input-output pairs. The model picks up on the pattern and generalizes, since the examples are in the context it’s conditioning on. This is particularly effective for formatting, classification, and translation tasks where showing is faster than telling.

The examples take up context on every call, and a skewed set teaches the skew, including the patterns you didn’t mean to show.

Chain of Thought

Complex tasks can overwhelm an LLM when presented all at once. Chain of thought prompting breaks the problem into explicit steps, asking the model to show its reasoning. Because each reasoning step becomes part of the context the next tokens condition on, the final answer is built on top of the scaffolding the model just wrote for itself. This improves accuracy on math, logic, and multi-step reasoning tasks. A model that gets the wrong answer when asked directly often gets the right answer when asked to work through the problem step by step.

The reasoning is generated output, so the improvement comes with more tokens and slower responses, and you pay for those extra tokens.

Prompt Chaining

Prompt Chaining Flow Flowchart showing sequential LLM calls: user input feeds Prompt 1, whose output feeds Prompt 2, continuing through Prompt N, with optional human review between steps. User Input Prompt #1 #1 Output human review Prompt #2 #2 Output human review Prompt #N End
Prompt chaining connects multiple LLM calls, with optional human review between steps

Instead of cramming everything into a single prompt, you break the workflow into a sequence of LLM calls. Each call handles one well-defined step, and the output of one becomes the input to the next. Humans can review intermediate outputs before the next step fires. This pattern is the backbone of most production LLM applications, and it’s exactly how the was built.

Latency stacks call over call, and every step is a place the chain can break unless each boundary validates what it received.

Retrieval Augmented Generation (RAG)

RAG addresses the outdated knowledge and hallucination problems by injecting fresh information into the context. Before sending a prompt to the LLM, you search a knowledge base (using vector embeddings and a vector database like Chroma or Pinecone) to find relevant documents. Those documents get injected into the prompt as context, grounding the model’s response in actual, up-to-date information.

Without RAG

Question

For which club does Lionel Messi play?

Prompt

Answer the following question to the best of your abilities: For which club does Lionel Messi play?

LLM Response

As of my latest update in 2021, Messi plays for Paris Saint-Germain FC.

With RAG

Question

For which club does Lionel Messi play?

Retrieved Context

Messi has signed with Inter Miami CF, joining David Beckham's club until 2025.

source: news article

Prompt

Based on the following context: Messi has signed with Inter Miami CF, joining David Beckham's club until 2025.

Answer the following question: For which club does Lionel Messi play?

LLM Response

Lionel Messi plays for Inter Miami CF.

RAG grounds the model with retrieved context, fixing outdated knowledge

For example, asking “For which club does Lionel Messi play?” on a model with a 2021 training cutoff gets you the confidently wrong answer “Paris Saint-Germain.” Feed it a retrieved news snippet about Messi’s 2023 Inter Miami signing in the prompt, and the same model answers correctly.

RAG shifts the problem rather than solving it: the answer is only as good as what retrieval surfaces, and a stale passage grounds the model just as confidently.

Function Calling

Function calling lets the LLM trigger real actions. Instead of just describing what should happen, the model’s output becomes structured data that your application code can act on, with the tool’s response flowing back into the context for the next generation step. Summarize data and then generate a chart from it. Answer a question and then log the interaction. This bridges the gap between “language, not action” and actual utility.

It moves the failure surface: the model can call the wrong function or invent arguments that don’t fit, and that output now runs as code instead of sitting in a reply.

The LLM Landscape

Choosing an LLM in 2024 means navigating two paths:

  • Open source models (LLaMA, Mistral) give you full control. You can inspect the weights, fine-tune for your domain, run locally, and avoid per-token API costs. The tradeoffs are real though. You need significant compute to run them, they currently lag behind proprietary models on most benchmarks, and community support, while active, is no substitute for enterprise SLAs.
  • Proprietary models (GPT-4, Gemini, Claude) are easier to get started with, offer better raw performance, and come with commercial support. The downsides are cost (API calls add up fast, and fine-tuning is expensive), vendor lock-in, and opacity. You can’t see how the model works or why it made a particular decision.

For evaluating models, a handful of benchmarks do most of the work. (Massive Multitask Language Understanding) covers 57 subjects from STEM to humanities and is the go-to measure of general knowledge. measures functional correctness on 164 programming problems generated from docstrings. The model’s output has to actually run and pass the tests. probes commonsense reasoning with questions humans solve more than 95% of the time but that still trip up state-of-the-art models. And sidesteps static benchmarks entirely, using head-to-head ELO ratings from crowd-sourced blind comparisons. As of February 2024, GPT-4 variants hold the top spots in the Arena, with Gemini Pro, Mistral Medium, and Claude competing in the top 10.

The tooling ecosystem has matured rapidly. LangChain and LlamaIndex provide frameworks for building LLM-powered applications. Chroma and Pinecone handle vector storage for RAG. Ollama makes running open source models locally trivial. Axolotl simplifies fine-tuning. Replicate offers LLMs as a service.

Putting It All Together: Super Bowl Ad Brief Generator

I built a demo for the talk. The Super Bowl Ad Brief Generator takes a one-sentence product description (“dog food that makes dogs able to talk”) and uses prompt chaining to turn it into an ad storyboard.

The pipeline is five GPT-4 Turbo calls wired together via LangChain, running idea description → creative brief → creative concepts → script → storyboard, with DALL-E 3 generating the final visual panels. Every stage parses the model’s output into a typed Pydantic schema, so a bad response fails at the boundary instead of corrupting the next prompt.

The repo has the slides, notebook, and source: . You can run the full pipeline end-to-end in with just an OpenAI API key.

Use Responsibly

LLMs carry real risks that grow with deployment.

Bias is inherited from training data. If the data contains stereotypes, the model reproduces them. Hallucinations mean you need verification layers for anything factual. Privacy is a concern when user data flows through third-party APIs. And the computational cost of training and running these models has a real environmental footprint.

One distinction matters above the rest. LLMs can sound equally confident when they’re right and when they’re wrong, and they rarely signal the difference. Treat output you care about as a draft that needs human review before you rely on it.

The field moves fast, and new models, techniques, and tools land weekly, but the layer underneath moves slower. Transformers still generate one token at a time, and the failure modes still trace back to training data, post-training, and a finite context window. Every workaround still comes down to shaping the one input you control. That’s the part worth learning once.

References
  1. Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau et al., 2014.
  2. Attention Is All You Need — Vaswani et al., 2017.