Intuition: a curious student
AI studies examples, forms internal rules, then gets feedback to refine them.
Playful visuals and mini missions to explore the math behind neural networks
AI builds systems that can sense, decide, and improve with data.
Artificial Intelligence (AI) is the field of creating systems that perceive their world, reason or learn from data, and choose actions to reach goals.
AI studies examples, forms internal rules, then gets feedback to refine them.
AI observes a map, plans a route, acts, then updates the plan after each step.
Different AI families focus on different parts of the loop, but they often work together.
Rules, logic, planning, and graph search to make decisions.
Models that learn patterns from labeled or unlabeled data.
Neural networks for vision, speech, language, and high-dimensional signals.
Agents learn policies through interaction and rewards.
Models that create text, images, or audio from learned distributions.
Perception plus control to act in the physical world.
Most real systems blend categories, such as robots that plan with search and see with deep learning.
A compact roadmap of the math that powers modern AI. Switch tabs to see each idea in action.
Learn how vectors and matrices stretch, rotate, and move space.
Goal: read \( \mathbf{h} = W\mathbf{x} + \mathbf{b} \).Understand slopes, tiny nudges, and the chain rule.
Goal: follow gradients through a network.Use gradients to walk downhill and find the best answers.
Goal: tune the learning rate \( \eta \).Turn scores into probabilities and measure mistakes.
Goal: use softmax and cross-entropy.Compute gradients for whole batches at once.
Goal: track shapes and vectorized rules.Neural networks are a stack of math ideas. Each tab below zooms in on one layer of that stack.
Think of a matrix as a magic machine that bends space. Each layer takes a vector and transforms it.
Matrices are necessary for deep learning because every layer is a matrix multiply that bundles thousands of weights, so we can compute many neuron activations and whole batches in one fast step.
Imagine a grid of points being stretched and tilted. That is what \(W\) does.
Derivatives tell us how fast things change. If the loss is a hill, the derivative tells the slope.
Let \(y = g(x)\) and \(L = f(y)\). A tiny nudge \(dx\) changes \(y\) by \(dy = g'(x)dx\), which changes the loss by \(dL = f'(y)dy\).
Backprop is the chain rule applied again and again through the whole network.
Once we know the slope, we take a step downhill to reduce the loss.
For a linear layer \( \mathbf{z} = W\mathbf{x} + \mathbf{b} \) with softmax + cross-entropy, the error at the logits is \( \hat{\mathbf{y}} - \mathbf{y} \).
For classification, the network turns scores into probabilities and compares them to the truth.
Loss derivation: with \( \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}) \) and one-hot \( \mathbf{y} \), the log-softmax derivative gives a simple gradient.
Result: \( \nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y} \), the signal that drives backprop.
To train fast, we compute gradients for a whole batch at once. Think of matrices as a neat way to stack many examples and do one big calculation.
Intuition: each gradient formula is "reverse flow" of the forward pass. We transpose the weight matrix to send signals back to the inputs, and we combine all examples with \(X^T\) to update the weights in one shot.
Shapes are your superpower: if \(X\) is \(n \times d\) and \(W\) is \(d \times m\), then \(Y\) is \(n \times m\). The gradients match those same shapes.
Tap a tab to switch the picture.
Open the slide-out panel to practice the math and see live output.
Tip: Use print() to surface intermediate steps.
Explore vectors as data points, similarity scores, and distances. Switch to Deep Dive for norms, projections, and basis intuition.
Think of vector A as a data point \(x\) and vector B as model weights \(w\). Use the buttons to see how scores and distances behave.
Step-by-step with the current \(x\) and \(w\) values:
Step through projections, norms, subspaces, gradients, vector stats, regularization, and attention with visual intuition.
Advanced vector ideas that show up in PCA, optimization, and attention.
Compute dot products, norms, and cosine similarity with live code.
Tip: Update x and w to match the visual slider values.
Matrices reshape space. Determinants measure area scaling, inverses undo transforms, and covariance/precision describe spread along eigenvector axes.
Focus on the three core moves: add, multiply, and reshape space with scaling + shear.
Multiplication mixes rows and columns to transform vectors.
Drag the sliders to stretch or slant the grid.
Explore determinants, inverses, covariance/precision, identity, and eigenvectors with step-by-step formulas.
Determinant is the area scale factor; sign indicates a flip.
Multiply matrices and inspect batch gradients in code.
Tip: Change X and W to see how shapes affect gradients.
Go from events to distributions, and see how probabilities become confidence bars.
Graph. Bars show probability mass per outcome; the curve shows probability density; Bayes view compares prior, likelihood, and posterior.
Venn diagram highlights the region tied to the formula.
Estimate the prior \(P(\text{spam})\) from training data, then combine it with word likelihoods.
Bag-of-words treats each word independently, so likelihoods multiply.
Prior × likelihoods → posterior.
Training chooses parameters that make the observed labels most probable. This is maximum likelihood estimation.
Low loss means the model assigns high probability to the true labels; overconfident mistakes are penalized most.
Product → log sum → negate so minimization equals maximum likelihood.
Use a Bernoulli model to estimate the probability of success. Step through the derivation, then tweak the data to see the curve move.
Follow the steps, then use the sliders to see how the estimate changes.
Adjust trials and successes. The curve peaks at the most likely \(p\).
The peak of the curve is the best probability estimate.
Uniform dots are outcomes. A is the left slice, B is the top slice, and the overlap shows \(A \cap B\).
Counts update from simulated dots.
A Hidden Markov Model (HMM) describes a system with hidden states that evolve over time and generate observable data.
Common uses: speech recognition, NLP, bioinformatics, and time-series analysis.
Filters estimate hidden state as new, noisy observations arrive, updating beliefs step by step.
The Kalman filter provides an optimal recursive estimate under those conditions.
Used in radar tracking, navigation, and robotics for position/velocity estimates.
Hidden states (circles) emit observations (squares) over time.
Prediction (line) is corrected toward noisy measurements (dots).
Binary outcomes with parameter \(p\).
Use for clicks, coins, yes/no labels.
Counts successes in \(n\) trials.
Use for pass/fail counts in batches.
Multiple discrete outcomes with probabilities.
Use for class labels or choices.
Continuous bell curve with \(\mu\) and \(\sigma\).
Use for noise, heights, residuals.
Test softmax and cross-entropy with a tiny classifier.
Tip: Swap logits to see the loss change.
Gradient descent is how neural networks learn. It's like walking downhill to find the lowest point:
Run a few gradient steps and watch the loss drop.
Tip: Tune the learning rate to see convergence speed.
Activation functions add non-linearity to neural networks, enabling them to learn complex patterns:
Squashes values to (0, 1). Used for binary classification.
Most popular! Simple and effective. Returns x if positive, 0 otherwise.
Squashes values to (-1, 1). Zero-centered version of sigmoid.
Machine learning is the study of systems that learn patterns from data to make predictions or decisions without being explicitly programmed for each case.
Given data \(x\) and outcomes \(y\), machine learning finds a function \(f_\theta\) that maps inputs to outputs by optimizing parameters \(\theta\) to reduce error.
Learn from labeled examples to predict targets or classes.
Discover structure or clusters without labels.
Generate labels from data itself to learn useful representations.
Use the cards as a quick baseline picker before tuning deeper models.
Predict continuous targets with a weighted sum.
Best-fit line follows the trend.
Binary classification with a sigmoid link.
Sigmoid separates two classes.
Predict by voting among nearest neighbors.
Nearest neighbors vote on the label.
Strong baseline for small data.
Recursive if-then splits on features.
Splits carve the data space.
Interpretable and fast.
Bagged trees reduce variance.
Many trees vote together.
Robust, less tuning.
Max-margin boundary; kernels add nonlinearity.
Max-margin boundary balances both classes.
Great for medium-sized data.
Probabilistic classifier with conditional independence.
Likelihoods overlap at the decision point.
Fast for text and spam.
Cluster points around K centroids.
Centroids pull points into clusters.
Simple unsupervised grouping.
Project data onto top-variance directions.
Principal axis captures max variance.
Dimensionality reduction.
Short derivations that connect each algorithm to its core objective.
Select an algorithm to see the key steps and a matching visualization.
Training tunes the weights with data and gradients. Inference freezes the weights and just predicts.
Zoom in from a single neuron all the way up to deep networks, CNNs, RNNs, LSTMs, and transformers. Each module walks step-by-step and animates the flow.
A network is stacked math. Inputs become features, hidden layers reshape them, and the output layer turns them into predictions.
Inference stops here. Training continues with loss, backprop, and weight updates.
Different architectures specialize in different kinds of data: images, sequences, or long-range context.
Classic dense layers for structured data.
Stacks many layers to build feature hierarchies.
Filters slide across images to find patterns.
State flows through time for sequences.
Gated memory handles long-range signals.
Self-attention mixes all tokens at once.
Each neuron multiplies inputs by weights, adds a bias, then runs an activation function.
Simple digit recognition starts from pixels, transforms them into hidden features, and predicts a digit.
Backpropagation computes how each weight should change to reduce the loss.
More layers let the model build a hierarchy of features.
Examples: face recognition, speech-to-text, medical scans.
Convolutions scan a small filter across pixels to detect patterns like edges or corners.
Recurrent networks reuse the same weights at every time step, passing a hidden state forward.
LSTMs add gates that decide what to forget, what to write, and what to output.
Transformers process all tokens at once, then use attention to mix information.
Use the stepper to follow the full Transformer pipeline without leaving the visuals.
Transformer is a neural network architecture that has fundamentally changed the approach to artificial intelligence. It was introduced in the 2017 paper "Attention is All You Need" and now powers models like OpenAI GPT, Meta Llama, and Google Gemini.
Transformers are not limited to text. They also drive audio generation, image recognition, protein structure prediction, and game playing, showing how broadly the architecture applies across domains.
Text-generative Transformers operate on next-token prediction: given a prompt, the model estimates the most probable next token. The core innovation is self-attention, which lets all tokens communicate and capture long-range dependencies.
Transformer Explainer is powered by GPT-2 (small) with 124 million parameters. While it is not the largest model, its components match the structure of more recent systems.
Every text-generative Transformer consists of three components:
Suppose the prompt is: "Data visualization empowers users to". Embedding converts this text into a numerical representation in four steps:
Figure 1. Expanding the embedding view: tokenization, token embedding, positional encoding, and final embedding.
GPT-2 (small) uses 768-dimensional embeddings and a vocabulary of 50,257 tokens. The embedding matrix has shape (50,257, 768) with about 39 million parameters.
The block combines multi-head self-attention and an MLP. GPT-2 (small) stacks 12 blocks, allowing token representations to evolve into higher-level meanings over depth.
Multi-head attention captures context across tokens. The MLP processes each token independently to refine its representation.
Each token is transformed into Query (Q), Key (K), and Value (V) vectors:
Query is like the search text, Key is the result title, and Value is the page content. This analogy helps explain why attention scores route information from relevant tokens.
Figure 2. Computing Q, K, and V from the embedding.
Scaling and masking set future positions to negative infinity so each token predicts without looking ahead.
Figure 3. Masked self-attention: dot product, scale + mask, softmax + dropout.
The MLP expands each token from 768 to 3072 dimensions with a GELU activation, then compresses back to 768. This enriches the representation independently per token.
Figure 4. The MLP expands then compresses each token representation.
The final linear layer maps to 50,257 logits, one for each token in the vocabulary. Softmax turns logits into probabilities for the next token.
Figure 5. Each token receives a probability from the output logits.
Temperature controls sharpness: T=1 keeps logits unchanged, T<1 makes outputs more deterministic, and T>1 increases randomness.
Lower values make outputs more deterministic; higher values add creativity.
Restrict sampling to the top k highest-probability tokens.
Sample from the smallest set whose cumulative probability exceeds p.
Layer normalization stabilizes training and is applied twice per block. Dropout regularizes by randomly deactivating units during training and is disabled during inference. Residual connections add skip paths twice per block, helping gradients flow and preventing vanishing gradients.
The overview highlights where you are in the stack; the detail panel shows the math.
A hands-on path that ties prompting, model behavior, and agent building into one workflow.
Define the role, goal, and success criteria before you write prompts.
Goal: a clear, testable objective.Use context blocks, constraints, and a target format.
Goal: predictable, repeatable outputs.Adjust temperature, context, and stop rules to stabilize responses.
Goal: reduce variance and hallucinations.Introduce tool calling for data access or automation.
Goal: verified outputs with traceability.Write evals, measure regressions, and refine.
Goal: steady improvements with evidence.ROLE: You are an AI PM. TASK: Draft a 1-page feature brief. CONTEXT: """ Include 5 user notes, metrics, and constraints. """ CONSTRAINTS: 250 words max. Bullet format. OUTPUT FORMAT: - Problem - Users - Constraints - Proposal - Risks CHECKS: List missing inputs.
Use this depth indicator to move from core mechanics to training signals and reasoning behaviors. Each tier adds a new layer of capability.
Mechanics and intuition.
Tokens turn into Q/K/V, attention mixes them, FFN refines, residual + norm stabilize.
Transformers model sequences by letting every token attend to every other token in one step, so computation is parallel instead of strictly sequential.
Encoder-decoder stacks build a memory from the input, while decoder-only stacks generate the next token with a causal mask.
Queries and keys produce similarity scores, softmax turns scores into weights, and values are mixed by a weighted sum.
Scaling by sqrt(d) keeps logits in a stable range so softmax does not saturate too early.
Each head attends in its own subspace, the heads are concatenated, and a projection mixes them back together.
Residual paths keep the original signal available, while layer norm stabilizes scale across the block.
Attention is permutation-invariant, so we add position signals to token embeddings before attention.
Sinusoidal encodings give each position a unique mix of frequencies; learned or relative schemes encode distance directly.
Training objectives and variants.
Decoder-only uses causal masks for next-token prediction; encoder-only sees both directions with masked tokens.
A causal mask blocks attention to future tokens, so each position can only use past context.
Training uses teacher forcing: the model sees true previous tokens while learning to predict the next one.
Masked language modeling hides tokens and uses both left and right context to reconstruct them.
The result is a strong bidirectional representation that transfers well to classification and retrieval tasks.
Logits become probabilities via softmax, then decoding picks the next token.
Greedy decoding is stable, while temperature and top-p sampling trade determinism for diversity.
Prompting and tool-use strategies.
CoT adds intermediate steps, self-consistency samples many paths, ReAct loops the model with tools.
CoT encourages explicit intermediate steps, which can improve multi-step reasoning tasks.
Use it when the reasoning path matters, but keep in mind verbosity does not guarantee correctness.
Self-consistency samples multiple reasoning paths and aggregates the final answers.
Majority voting reduces variance and often improves accuracy on reasoning benchmarks.
ReAct alternates reasoning steps with tool calls, grounding answers in external data.
The loop keeps the model honest by injecting retrieved facts before the final response.
Training signals and test-time search.
Step-level feedback and search trees both add compute that shapes reasoning at test time.
Process supervision scores intermediate steps, not just the final answer.
Step-level signals make credit assignment clearer and improve reasoning stability.
Tree-of-Thoughts explores multiple branches, scores them, and expands the best candidates.
Search with backtracking can outperform a single forward pass on hard problems.
Modern reasoning blends prompting patterns, test-time search, and step-aware supervision.
The behavior you see is shaped by where extra compute or feedback is injected.
Design prompts that reduce ambiguity, keep responses grounded, and make iteration fast.
Use clear separators so the model knows what to quote, summarize, or transform.
Add one or two examples that match the output style you want.
Ask for a brief verification step or uncertainties before the final answer.
ROLE: You are a product analyst. TASK: Summarize customer feedback into 3 themes. CONTEXT: """ Paste notes here. """ CONSTRAINTS: Max 8 bullets. No speculation. OUTPUT FORMAT: - Theme: - Evidence: CHECKS: Flag missing data.
Technique: Role + constraints ROLE: You are a support analyst. TASK: Summarize the ticket in 3 bullets. CONSTRAINTS: Neutral tone. No jargon. Technique: Few-shot style match TASK: Turn notes into a decision line. EXAMPLE: Input: "Latency 120ms, budget ok" Output: "Decision: proceed with rollout" Technique: Delimiters + extraction Extract action items from <notes>...</notes>. Return JSON with owner, task, and due.
Know the mechanics behind next-token prediction so your prompts behave consistently.
Controls randomness; lower is more deterministic.
Limits choices to a probability mass.
Budget for output length and cost.
End outputs at safe boundaries.
See the full playbook at promptingguide.ai.
See how embeddings get indexed, how similarity search works, and how results return to the model.
Chunk documents, embed each chunk, and store vectors with metadata for filtering.
Tags, permissions, and timestamps narrow search.
Balances recall with context length.
Hierarchical graph layers speed up nearest-neighbor search with high recall.
Controls graph connectivity and recall.
Higher values improve recall at a latency cost.
Inverted file indexing narrows search to the closest centroids, then product quantization compresses vectors for fast distance estimates.
Number of coarse clusters to search.
Subvector count and code size per block.
Blend lexical and vector results, then rerank and pack the best chunks into context.
Cross-encoders boost precision on top results.
Keep track of sources for trust and audits.
Switch tabs or cycle the diagram.
A curated reading list with the main themes and ideas captured for quick scanning.
Annotated walkthrough of the Transformer model that uses self-attention for sequence tasks.
Blog essay connecting complexity, computation, and principles relevant to AI foundations.
Karpathy's blog showing why RNNs are powerful for sequence tasks with intuitive examples.
Christopher Olah's visual and intuitive explanation of LSTM mechanisms and gates.
Shows how regularization like dropout improves recurrent architectures.
Explores regularization and model simplicity through information theory principles.
Neural architecture for solving combinatorial problems by learning pointer outputs.
AlexNet paper that sparked the modern deep learning vision revolution.
Explores how ordering impacts seq2seq model performance.
Describes pipeline parallelism for scaling large neural networks.
ResNet paper that introduced residual connections to enable very deep models.
Introduces dilated convolutions for large receptive fields without pooling.
Graph neural network model for learning on structured data.
Introduced the Transformer with self-attention, the foundation of modern NLP and LLMs.
Shows how attention improves neural translation by aligning outputs with inputs.
Improves ResNet training using identity skip connections.
Introduces a module for relational reasoning tasks.
Combines autoencoder with variational losses for generative modeling.
Combines relational reasoning with RNN mechanisms.
Examines complexity measures in closed computational systems.
Early memory-augmented neural network with external controller.
End-to-end speech model demonstrating RNN and CNN integration.
Shows how model and data scaling improve language model performance.
Introductory explanation of MDL principle connecting compression and learning.
Link unavailable.
Discusses theoretical aspects of AGI and intelligence measures.
Link unavailable.
Foundations of algorithmic complexity and information theory.
Link unavailable.
Stanford course notes covering fundamentals of CNNs and vision models.
Agents combine planning, tools, memory, and evals to finish real tasks.
Breaks the goal into steps and picks the next action.
Chooses APIs, search, or code execution based on the task.
Stores notes, intermediate results, and long-term facts.
Checks outputs, runs evals, and flags regressions.
LangChain, LlamaIndex, and OpenAI or Anthropic SDKs.
Capture tool calls, prompts, and outputs for audits.
Use unit tests, golden sets, and regression suites.
Guardrails for tool use, data access, and output policy.
Agents learn by interacting with an environment, collecting rewards, and improving a policy.
Reinforcement learning (RL) is a learning framework where an agent chooses actions in a state, receives rewards, and updates its policy to maximize long-term return.
reset(seed?) -> observation/state
step(action) -> { nextState, reward, done, info }
actions(state) -> action list
// render helpers stay separate from logic
reset(seed?) starts a new episode and returns the initial state (optionally deterministic with a seed).
step(action) applies an action and returns the transition tuple: next state, reward, terminal flag, and any extra info.
actions(state) exposes valid actions so the agent can plan or explore safely.
Render helpers stay separate so learning logic is deterministic and testable.
RL formalizes learning by experience: act, observe, update, repeat.
At each step the agent is in a state \(s\), takes an action \(a\), receives a reward \(r\), and lands in a new state \(s'\).
This sequence is a trajectory (episode).
Rewards are immediate feedback. Returns add up future rewards with discounting.
A policy tells the agent how to act:
Value functions measure how good states or actions are under a policy.
V evaluates a state; Q evaluates a decision.
Value is defined recursively: immediate reward plus discounted value of the next state.
If transitions and rewards are known, compute values directly.
Policy iteration alternates evaluation and greedy improvement.
Learn from complete episodes when the model is unknown.
Update values online using one-step bootstrapping.
\(\delta\) is the TD error: how wrong the prediction was.
Learn the best actions, not just values.
SARSA uses the next action taken; Q-learning uses the best possible next action.
Tables do not scale to high-dimensional states or continuous actions.
Immediate feedback signal.
Discounted sum of future rewards.
Behavior rule \(\pi(a \mid s)\).
How good a state or action is.
Recursive value definitions.
Learn from full episodes.
Learn step-by-step with bootstrapping.
Learn optimal behavior off-policy.
Scale RL with neural networks.
Explore exploration strategies and track regret as the agent learns.
What is a multi-armed bandit? An agent repeatedly chooses among \(K\) actions (arms) with unknown reward distributions. The goal is to maximize total reward by balancing exploration (learn the arms) and exploitation (use the best-known arm).
Chart 1. True arm means (light) vs estimated means (dark).
Chart 2. Cumulative regret over time.
Chart 3. Action selection frequency by arm.
Solve a Markov Decision Process with Bellman updates and visualize value and policy.
Deep dive: Markov Decision Process. An MDP assumes the future depends only on the current state and action (the Markov property). Transitions \(P(s'|s,a)\) and rewards \(R(s,a,s')\) define how the agent moves. Dynamic programming solves the MDP by repeatedly applying Bellman backups until values and policies converge.
Click a cell to see Q(s,·).
Grid. Values \(V(s)\), heatmap, and policy arrows per state.
Chart. Max \(\Delta V\) and average \(V\) per iteration.
Estimate values from complete episode returns.
Deep dive: Monte Carlo methods. MC methods wait until an episode ends, then use the realized return to update value estimates. They are unbiased but can have high variance, so averaging many episodes stabilizes learning.
Grid. Episode rollout animation and evolving \(V(s)\) heatmap.
Chart. Returns histogram (recent episodes).
Chart. Value estimate of the start state over episodes.
Blend bootstrapping with sampling for faster learning.
Chart. Current value estimates for the random-walk states.
Chart. TD error \(\delta_t\) over a single episode.
Chart. Value estimates per state across episodes.
Learn optimal policies directly from experience.
Grid. Greedy policy arrows from the current Q-table.
Chart. Q-value heatmaps per action (Up/Right/Down/Left).
Chart. Exploration schedule \(\varepsilon\) over time.
Chart. Episode return over training.
When state spaces grow, deep networks approximate value functions or policies.
Increase the probability of actions that led to higher return and decrease others.
Work through the foundations topics in order. Each cell runs in a shared kernel for a practical hands-on workshop.
Cover the Foundations math with a practical hands-on workshop: linear algebra, calculus, optimization, probability, activations, matrix calculus, plus PyTorch fundamentals and tutorials.
Lists, loops, and functions to prep for vector math.
Dot products, norms, matrix-vector products, and matrix multiplication.
Estimate derivatives and follow gradients downhill.
Turn scores into probabilities and compare activation curves.
Compute a vectorized gradient for linear regression.
Tensors, autograd, modules, and optimizers. These cells print install guidance if PyTorch is unavailable.
This web lab runs in the browser and cannot install PyTorch. To run the PyTorch cells, open the notebook in local Jupyter or Colab and run the install cell below. For GPU builds, use the command from the PyTorch get-started page.
Run a small ONNX model directly in the browser using JavaScript.
Loads a tiny MNIST classifier and runs inference on simple 28x28 input patterns. No Python kernel required.
If the model URL fails, use another ONNX model that accepts a 1x1x28x28 float tensor.
Output will appear here.
Idle