🧠 Artificial Intelligence Math Foundations

Playful visuals and mini missions to explore the math behind neural networks

Mode:

What is AI?

AI builds systems that can sense, decide, and improve with data.

Artificial Intelligence (AI) is the field of creating systems that perceive their world, reason or learn from data, and choose actions to reach goals.

Intuition: a curious student

AI studies examples, forms internal rules, then gets feedback to refine them.

Intuition: a navigator

AI observes a map, plans a route, acts, then updates the plan after each step.

AI categories in practice

Different AI families focus on different parts of the loop, but they often work together.

Symbolic & search

Rules, logic, planning, and graph search to make decisions.

Machine learning

Models that learn patterns from labeled or unlabeled data.

Deep learning

Neural networks for vision, speech, language, and high-dimensional signals.

Reinforcement learning

Agents learn policies through interaction and rewards.

Generative AI

Models that create text, images, or audio from learned distributions.

Embodied & robotics

Perception plus control to act in the physical world.

Most real systems blend categories, such as robots that plan with search and see with deep learning.

Foundations: Core Math Roadmap

A compact roadmap of the math that powers modern AI. Switch tabs to see each idea in action.

Step 1

Linear Algebra: Space Benders

Learn how vectors and matrices stretch, rotate, and move space.

Goal: read \( \mathbf{h} = W\mathbf{x} + \mathbf{b} \).
Step 2

Calculus: Change Detective

Understand slopes, tiny nudges, and the chain rule.

Goal: follow gradients through a network.
Step 3

Optimization: Mountain Hikes

Use gradients to walk downhill and find the best answers.

Goal: tune the learning rate \( \eta \).
Step 4

Probability: Confidence Radar

Turn scores into probabilities and measure mistakes.

Goal: use softmax and cross-entropy.
Step 5

Matrix Calculus: Fast Backprop

Compute gradients for whole batches at once.

Goal: track shapes and vectorized rules.

Neural networks are a stack of math ideas. Each tab below zooms in on one layer of that stack.

Linear algebra: what the network is

Think of a matrix as a magic machine that bends space. Each layer takes a vector and transforms it.

Matrices are necessary for deep learning because every layer is a matrix multiply that bundles thousands of weights, so we can compute many neuron activations and whole batches in one fast step.

\[ \mathbf{h} = W\mathbf{x} + \mathbf{b} \] \[ \mathbf{a} = \mathrm{ReLU}(\mathbf{h}) \]

Imagine a grid of points being stretched and tilted. That is what \(W\) does.

  • Vectors are arrows (direction + length).
  • Matrices are transformations (stretch, rotate, shear).
  • Layers stack these transformations with a nonlinearity.
Mini mission: pick a vector and predict where the grid sends it.

Calculus: how learning works

Derivatives tell us how fast things change. If the loss is a hill, the derivative tells the slope.

Let \(y = g(x)\) and \(L = f(y)\). A tiny nudge \(dx\) changes \(y\) by \(dy = g'(x)dx\), which changes the loss by \(dL = f'(y)dy\).

\[ y = g(x), \quad L = f(y) \] \[ \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} = f'(g(x)) \cdot g'(x) \]

Backprop is the chain rule applied again and again through the whole network.

  • Local slope \(\times\) upstream slope = new slope.
  • Each layer multiplies by its derivative and passes the signal back.
Mini mission: point to where the slope is steepest.

Optimization: gradient descent and learning rate

Once we know the slope, we take a step downhill to reduce the loss.

For a linear layer \( \mathbf{z} = W\mathbf{x} + \mathbf{b} \) with softmax + cross-entropy, the error at the logits is \( \hat{\mathbf{y}} - \mathbf{y} \).

\[ W \leftarrow W - \eta (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{x}^T, \quad \mathbf{b} \leftarrow \mathbf{b} - \eta (\hat{\mathbf{y}} - \mathbf{y}) \]
  • \(\eta\): learning rate (step size)
  • \(\hat{\mathbf{y}} - \mathbf{y}\): probability error signal from the loss
  • \(\mathbf{x}\): input features that scale the step
  • Too big → unstable or diverges
  • Too small → slow progress
Mini mission: choose a step size that reaches the valley without bouncing.

Probability + information theory

For classification, the network turns scores into probabilities and compares them to the truth.

\[ \mathbf{z} = W\mathbf{x} + \mathbf{b} \] \[ \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}), \quad \hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}} \] \[ L = -\sum_i y_i \log \hat{y}_i \]
  • Scores \(\mathbf{z}\): raw logits before normalization.
  • \(\hat{\mathbf{y}}\): softmax outputs probabilities that sum to 1.
  • \(\mathbf{y}\): the true label as a one-hot distribution.
  • Loss \(L\): cross-entropy penalizes low probability on the true class; minimizing \(L\) pushes \(\hat{\mathbf{y}}\) toward \(\mathbf{y}\).

Loss derivation: with \( \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}) \) and one-hot \( \mathbf{y} \), the log-softmax derivative gives a simple gradient.

\[ \frac{\partial L}{\partial z_j} = \sum_i y_i (\hat{y}_j - \delta_{ij}) = \hat{y}_j - y_j \]

Result: \( \nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y} \), the signal that drives backprop.

Mini mission: spot the biggest probability bar.

Matrix calculus: the speed path

To train fast, we compute gradients for a whole batch at once. Think of matrices as a neat way to stack many examples and do one big calculation.

  • \(X\): a batch of inputs (rows are examples).
  • \(W\): weights (columns are feature-to-output recipes).
  • \(Y\): outputs for the whole batch.
\[ Y = XW \] \[ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T \] \[ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} \]

Intuition: each gradient formula is "reverse flow" of the forward pass. We transpose the weight matrix to send signals back to the inputs, and we combine all examples with \(X^T\) to update the weights in one shot.

  1. Forward pass: multiply the batch \(X\) by weights \(W\) to get outputs \(Y\).
  2. Backprop to inputs: push gradients through \(W^T\) to get \( \partial L / \partial X \).
  3. Backprop to weights: combine \(X^T\) with \( \partial L / \partial Y \) to get \( \partial L / \partial W \).

Shapes are your superpower: if \(X\) is \(n \times d\) and \(W\) is \(d \times m\), then \(Y\) is \(n \times m\). The gradients match those same shapes.

Mini mission: label the shapes of \(X\), \(W\), and \(Y\).

Tap a tab to switch the picture.

Python practice

Foundations playground

Open the slide-out panel to practice the math and see live output.

Tip: Use print() to surface intermediate steps.

Vectors: direction, magnitude, and meaning

Explore vectors as data points, similarity scores, and distances. Switch to Deep Dive for norms, projections, and basis intuition.

Vector playground

Think of vector A as a data point \(x\) and vector B as model weights \(w\). Use the buttons to see how scores and distances behave.

  • Data point: the arrow from the origin to \(x\).
  • Modulus (length): \( \lVert x \rVert = \sqrt{x_1^2 + x_2^2} \).
  • Dot product: the shadow of \(x\) on \(w\).
  • Cosine: angle-only similarity.
  • Distance: how far two points are apart.
\[ x = [x_1, x_2], \quad w = [w_1, w_2] \]
x + w [0, 0]
Dot 0
Cosine 0
Angle
Modulus ||x|| 0
Distance 0

Step-by-step with the current \(x\) and \(w\) values:

Sum

Dot

Modulus \(\lVert x \rVert\)

Cosine

Angle

Distance

Vector A (x)
Vector B (w)
Zoom

Deep dive: advanced geometry

Step through projections, norms, subspaces, gradients, vector stats, regularization, and attention with visual intuition.

\[ \lVert x \rVert = \sqrt{x_1^2 + x_2^2} \; (\text{modulus}), \quad x \cdot w = \sum_i x_i w_i, \quad x = \sum_i \alpha_i b_i \]
Scene 1 of 7

Scene 1: Projections and orthogonality

Visual

Mini example

ML intuition

Deep dive concepts

Advanced vector ideas that show up in PCA, optimization, and attention.

1

Projections & orthogonality

  • Projection onto \(u\): \( \mathrm{proj}_u(x) = \frac{x \cdot u}{u \cdot u} u \).
  • Orthogonal vectors give clean coordinate systems (PCA/SVD).
2

Norms & normalization

  • Modulus (length): \( \lVert x \rVert_2 = \sqrt{\sum_i x_i^2} \).
  • Normalize to compare directions independent of scale.
3

Subspaces, basis, rank

  • Span is the set of all linear combinations of vectors.
  • Rank is the dimension of that span (effective dimensionality).
4

Gradients, Jacobians, Hessians

  • \(\nabla_x L\) is the gradient of a scalar loss with respect to a vector.
  • Jacobian \(J\) stacks partials: \(J_{ij} = \partial f_i / \partial x_j\).
  • If \(f: \mathbb{R}^n \to \mathbb{R}^m\), then \(J\) is \(m \times n\) and \(f(x+\Delta x) \approx f(x) + J\Delta x\).
  • Hessian \(H = \nabla^2 L\) captures curvature; eigenvalues reveal minima vs saddles.
  • Second-order step: \(\Delta x = -H^{-1} \nabla L\) (Newton-style).
5

Vector statistics

  • Mean vector and covariance describe feature distributions.
  • Centering and whitening stabilize training and PCA.
6

Regularization geometry

  • L2 prefers small norms; L1 encourages sparsity.
  • Constraint shapes explain why L1 yields zeros.
7

Attention via dot products

  • Similarity scores become weights after softmax.
  • Output is a weighted sum of value vectors.
Python playground

Vector practice

Compute dot products, norms, and cosine similarity with live code.

Tip: Update x and w to match the visual slider values.

Matrices: transforms, determinants, inverses

Matrices reshape space. Determinants measure area scaling, inverses undo transforms, and covariance/precision describe spread along eigenvector axes.

Matrix playground

Focus on the three core moves: add, multiply, and reshape space with scaling + shear.

  • Addition mixes layers in the same shape.
  • Multiplication blends rows and columns.
  • Scaling stretches; shear slants the grid.
\[ C = AB \]

Multiplication mixes rows and columns to transform vectors.

Matrix deep dive: advanced operators

Explore determinants, inverses, covariance/precision, identity, and eigenvectors with step-by-step formulas.

\[ \det(A) = ad - bc \]

Determinant is the area scale factor; sign indicates a flip.

  • Small determinants signal near-singular matrices.
  • Precision matrices reweight directions in space.
  • Eigenvectors preserve direction under \(A\).

Derivations: covariance, precision, eigen

Determinant (2x2)

\[ A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, \quad \det(A)=ad-bc \]
  1. Expand by the first row: \(a\cdot d - b\cdot c\).
  2. \(|\det(A)|\) is area scale; sign means flip.
  3. \(\det(A)=0\) implies no inverse exists.

Inverse (2x2)

\[ A^{-1}=\frac{1}{\det(A)}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]
  1. Require \(\det(A)\neq 0\).
  2. Swap the diagonal, negate off-diagonals.
  3. Divide by \(\det(A)\) so \(AA^{-1}=I\).

Covariance matrix

\[ \mu=\frac{1}{n}\sum_{i=1}^n x_i,\quad X_c = X-\mathbf{1}\mu^T \] \[ \Sigma=\frac{1}{n-1}X_c^T X_c \]
  1. Center each feature: subtract the mean \(\mu\).
  2. Accumulate outer products: \(x_c x_c^T\).
  3. Average: \(\Sigma_{xy}=\frac{1}{n-1}\sum (x_i-\mu_x)(y_i-\mu_y)\).

Precision matrix

\[ \Lambda = \Sigma^{-1} = \frac{1}{\sigma_x^2 \sigma_y^2 - \sigma_{xy}^2} \begin{bmatrix} \sigma_y^2 & -\sigma_{xy} \\ -\sigma_{xy} & \sigma_x^2 \end{bmatrix} \]
  1. Compute \(\det(\Sigma)\).
  2. Apply the 2x2 inverse formula.
  3. In a Gaussian, \((x-\mu)^T\Lambda(x-\mu)\) is the Mahalanobis distance.

Eigenvalues & eigenvectors

\[ A\mathbf{v}=\lambda \mathbf{v},\quad \det(A-\lambda I)=0 \] \[ \lambda=\frac{\operatorname{tr}A \pm \sqrt{(\operatorname{tr}A)^2-4\det(A)}}{2} \]
  1. Solve the characteristic equation for \(\lambda\).
  2. Plug into \((A-\lambda I)\mathbf{v}=0\) to get \(\mathbf{v}\).
  3. For symmetric \(A\), eigenvectors are orthonormal axes.
Mini mission: spot the operation that flips the grid.
Python playground

Matrix practice

Multiply matrices and inspect batch gradients in code.

Tip: Change X and W to see how shapes affect gradients.

Probability: uncertainty and confidence

Go from events to distributions, and see how probabilities become confidence bars.

Events, distributions, Bayes

  • Probabilities sum to 1 across a full set of outcomes.
  • Conditional probability updates beliefs with new evidence.
  • Expectations summarize the average outcome.
\[ P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} \] \[ \mathbb{E}[X] = \sum_x x \, P(X=x) \]

Graph. Bars show probability mass per outcome; the curve shows probability density; Bayes view compares prior, likelihood, and posterior.

Probability review: events & notation

  • \(P(A)\) is between 0 and 1.
  • Sample space \(S\) contains all outcomes; \(P(S)=1\).
  • Complement: \(P(\overline{A}) = 1 - P(A)\).
  • Union: \(P(A \cup B)\), Intersection: \(P(A \cap B)\).
  • Conditional: \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\).
  • Random variable \(X\) maps outcomes to numbers, with \(P(X=x)\).
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] \[ P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} \]

Venn diagram highlights the region tied to the formula.

Bayesian spam filter intuition

Estimate the prior \(P(\text{spam})\) from training data, then combine it with word likelihoods.

\[ P(\text{spam} \mid \text{message}) \propto P(\text{spam}) \prod_i P(\text{word}_i \mid \text{spam}) \]

Bag-of-words treats each word independently, so likelihoods multiply.

\[ P(\text{message}) = P(\text{message} \mid \text{spam})P(\text{spam}) + P(\text{message} \mid \text{ham})P(\text{ham}) \]
Prior
P(words|spam) -
P(words|ham) -
Posterior spam -

Prior × likelihoods → posterior.

Why negative log likelihood becomes the loss

Training chooses parameters that make the observed labels most probable. This is maximum likelihood estimation.

  1. Model the probability of the correct label. For each example \( (x_i, y_i) \), the model returns \( p_\theta(y_i \mid x_i) \).
  2. Likelihood of the dataset. Assuming samples are independent, \( L(\theta) = \prod_{i=1}^N p_\theta(y_i \mid x_i) \).
  3. Log-likelihood simplifies the product. The log is monotonic, so maximizing \(L\) matches maximizing \( \log L = \sum_i \log p_\theta(y_i \mid x_i) \), and it avoids numerical underflow.
  4. Negative log-likelihood is the minimization objective. We minimize \( \mathcal{L}(\theta) = -\log L(\theta) \). For one-hot classification, this is cross-entropy.
\[ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log p_\theta(y_i \mid x_i) = -\frac{1}{N}\sum_{i=1}^N \sum_k y_{ik} \log \hat{y}_{ik} \]

Step 1: Multiply

\[ L = 0.9 \times 0.6 \times 0.2 = 0.108 \]

Step 2: Log-sum

\[ \begin{aligned} \log L &= \log 0.9 + \log 0.6 + \log 0.2 \\ &= -0.105 - 0.511 - 1.609 \\ &= -2.225 \end{aligned} \]

Step 3: Negate

\[ \mathcal{L} = -\log L = 2.225 \]

Low loss means the model assigns high probability to the true labels; overconfident mistakes are penalized most.

1

Likelihood (product)

0.9 × 0.6 × 0.2 = 0.108
Multiply probabilities across samples.
\[ L(\theta) = \prod_i p_\theta(y_i \mid x_i) \]
2

Log-likelihood (sum)

\(\log 0.9\) + \(\log 0.6\) + \(\log 0.2\) = \(-2.225\)
-0.105 0.511 1.609
Log turns products into sums and keeps numbers stable.
\[ \log L(\theta) = \sum_i \log p_\theta(y_i \mid x_i) \]
3

Negative log-likelihood (loss)

\(-\log L\) = 2.225
Negate so minimization matches maximum likelihood.
\[ \mathcal{L}(\theta) = -\sum_i \log p_\theta(y_i \mid x_i) \]

Product → log sum → negate so minimization equals maximum likelihood.

Interactive

Log-likelihood playground: estimate \(p\)

Use a Bernoulli model to estimate the probability of success. Step through the derivation, then tweak the data to see the curve move.

Step 1 of 5

Define the data

Follow the steps, then use the sliders to see how the estimate changes.

Playground

Likelihood explorer

Adjust trials and successes. The curve peaks at the most likely \(p\).

Observed data -
MLE \(\hat{p}\) -
Likelihood \(L(p)\) -
Log-likelihood -
NLL -

Trials (N)
Successes (k)
Guess \(p\)

The peak of the curve is the best probability estimate.

Sample space simulator

Uniform dots are outcomes. A is the left slice, B is the top slice, and the overlap shows \(A \cap B\).

\[ P(A)=\frac{|A|}{|S|},\quad P(A \mid B)=\frac{P(A \cap B)}{P(B)} \]
Event A width
Event B height
P(A) -
P(B) -
P(A ∩ B) -
P(A|B) -

Counts update from simulated dots.

Hidden Markov Models & filters

What is a Hidden Markov Model?

A Hidden Markov Model (HMM) describes a system with hidden states that evolve over time and generate observable data.

  • Hidden states transition via a Markov process.
  • Observations are emitted based on the current hidden state.
  • The task is to infer hidden states from observations.
\[ P(X_t \mid X_{t-1}) \quad \text{and} \quad P(E_t \mid X_t) \]

Common uses: speech recognition, NLP, bioinformatics, and time-series analysis.

What filters do

Filters estimate hidden state as new, noisy observations arrive, updating beliefs step by step.

  • Continuous states and observations.
  • Linear dynamics and observation models.
  • Gaussian noise assumptions.

The Kalman filter provides an optimal recursive estimate under those conditions.

Used in radar tracking, navigation, and robotics for position/velocity estimates.

Hidden Markov Model

Hidden states (circles) emit observations (squares) over time.

Kalman filter

Prediction (line) is corrected toward noisy measurements (dots).

Bernoulli

Binary outcomes with parameter \(p\).

Use for clicks, coins, yes/no labels.

\[ P(X=1)=p,\quad P(X=0)=1-p \]

Binomial

Counts successes in \(n\) trials.

Use for pass/fail counts in batches.

\[ P(X=k)=\binom{n}{k}p^k(1-p)^{n-k} \]

Categorical

Multiple discrete outcomes with probabilities.

Use for class labels or choices.

\[ P(X=i)=p_i,\quad \sum_i p_i = 1 \]

Normal

Continuous bell curve with \(\mu\) and \(\sigma\).

Use for noise, heights, residuals.

\[ X \sim \mathcal{N}(\mu,\sigma^2) \] \[ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)} \]
Python playground

Probability practice

Test softmax and cross-entropy with a tiny classifier.

Tip: Swap logits to see the loss change.

Gradient Descent

Finding the Minimum

Gradient descent is how neural networks learn. It's like walking downhill to find the lowest point:

\[ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla J(\theta) \]
  • \(\theta\): Parameter being optimized
  • \(\alpha\): Learning rate (step size)
  • \(\nabla J(\theta)\): Gradient (direction of steepest increase)
Python playground

Optimization practice

Run a few gradient steps and watch the loss drop.

Tip: Tune the learning rate to see convergence speed.

Activation Functions

Non-Linear Transformations

Activation functions add non-linearity to neural networks, enabling them to learn complex patterns:

Sigmoid: \( \sigma(x) = \frac{1}{1+e^{-x}} \)

Squashes values to (0, 1). Used for binary classification.

ReLU: \( f(x) = \max(0, x) \)

Most popular! Simple and effective. Returns x if positive, 0 otherwise.

Tanh: \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)

Squashes values to (-1, 1). Zero-centered version of sigmoid.

What is Machine Learning?

Machine learning is the study of systems that learn patterns from data to make predictions or decisions without being explicitly programmed for each case.

Definition

Given data \(x\) and outcomes \(y\), machine learning finds a function \(f_\theta\) that maps inputs to outputs by optimizing parameters \(\theta\) to reduce error.

\[ \theta^* = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i) \]

How it works (intuitive)

  • Observe: collect examples and features.
  • Learn: fit a model that captures patterns.
  • Evaluate: test on new data to check generalization.
  • Use: deploy the model to make predictions.

Supervised learning

Learn from labeled examples to predict targets or classes.

Unsupervised learning

Discover structure or clusters without labels.

Self-supervised learning

Generate labels from data itself to learn useful representations.

Machine Learning Algorithms

Common algorithm families

  • Supervised: learn from labeled examples to predict or classify.
  • Unsupervised: discover structure without labels.
  • Probabilistic: express uncertainty with likelihoods and priors.
  • Ensembles: combine many models to reduce error.
\[ \theta^* = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i) + \lambda \lVert \theta \rVert^2 \]

Use the cards as a quick baseline picker before tuning deeper models.

Linear Regression

Predict continuous targets with a weighted sum.

Best-fit line follows the trend.

\[ \hat{y} = \mathbf{w}^\top \mathbf{x} + b \]

Logistic Regression

Binary classification with a sigmoid link.

Sigmoid separates two classes.

\[ P(y=1 \mid x) = \sigma(\mathbf{w}^\top \mathbf{x} + b) \]

k-NN

Predict by voting among nearest neighbors.

Nearest neighbors vote on the label.

Strong baseline for small data.

Decision Trees

Recursive if-then splits on features.

Splits carve the data space.

Interpretable and fast.

Random Forest

Bagged trees reduce variance.

Many trees vote together.

Robust, less tuning.

Support Vector Machine

Max-margin boundary; kernels add nonlinearity.

Max-margin boundary balances both classes.

Great for medium-sized data.

Naive Bayes

Probabilistic classifier with conditional independence.

Likelihoods overlap at the decision point.

Fast for text and spam.

k-Means

Cluster points around K centroids.

Centroids pull points into clusters.

Simple unsupervised grouping.

PCA

Project data onto top-variance directions.

Principal axis captures max variance.

Dimensionality reduction.

Formulas and derivations

Short derivations that connect each algorithm to its core objective.

Linear Regression

\[ J(\mathbf{w}) = \frac{1}{n}\lVert X\mathbf{w} - \mathbf{y} \rVert^2 \]
  1. Take gradient: \( \nabla J = \frac{2}{n} X^T (X\mathbf{w} - \mathbf{y}) \).
  2. Set to zero: \( X^T X \mathbf{w} = X^T \mathbf{y} \).
  3. Solve for \( \mathbf{w} \) (normal equation) or use gradient descent.

Logistic Regression

\[ L = -\sum_i y_i \log \sigma(z_i) + (1-y_i)\log(1-\sigma(z_i)) \]
  1. Let \( z_i = \mathbf{w}^T \mathbf{x}_i + b \), \( p_i = \sigma(z_i) \).
  2. Derivative: \( \partial L / \partial z_i = p_i - y_i \).
  3. Gradient: \( \nabla_{\mathbf{w}} L = X^T(\mathbf{p} - \mathbf{y}) \).

k-NN

\[ \hat{y} = \text{mode}\{y_i : i \in \mathcal{N}_k(x)\} \]
  1. No training objective; store all labeled points.
  2. Compute distances \( d(x, x_i) \) to all points.
  3. Pick the k closest neighbors and vote (or average for regression).

Decision Trees

\[ \text{Gain} = H(\text{parent}) - \sum_k p_k H(\text{child}_k) \]
  1. Compute impurity (entropy or Gini) at the parent node.
  2. Evaluate candidate splits and compute weighted child impurity.
  3. Choose the split with the largest information gain.

Random Forest

\[ \hat{y} = \frac{1}{T} \sum_{t=1}^T h_t(x) \]
  1. Train many trees on bootstrapped samples with random feature subsets.
  2. Aggregate predictions by averaging or voting.
  3. Averaging reduces variance when trees are diverse.

Support Vector Machine

\[ \min_{\mathbf{w}} \; \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_i \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) \]
  1. Maximize margin by minimizing \( \lVert \mathbf{w} \rVert^2 \).
  2. Hinge loss penalizes points inside the margin or misclassified.
  3. Only support vectors affect the optimal boundary.

Naive Bayes

\[ P(y \mid x) \propto P(y)\prod_i P(x_i \mid y) \]
  1. Start from Bayes: \( P(y \mid x) = P(x \mid y)P(y)/P(x) \).
  2. Assume conditional independence: \( P(x \mid y) = \prod_i P(x_i \mid y) \).
  3. Choose the class with the largest posterior.

k-Means

\[ \min_{\{\mu_k\}} \sum_i \lVert x_i - \mu_{c_i} \rVert^2 \]
  1. Assign each point to the nearest centroid.
  2. Update centroids by setting \( \partial J / \partial \mu_k = 0 \Rightarrow \mu_k = \text{mean} \).
  3. Repeat until assignments stabilize.

PCA

\[ \max_{\lVert \mathbf{w} \rVert = 1} \mathbf{w}^T \Sigma \mathbf{w} \]
  1. Use Lagrange multiplier: \( \Sigma \mathbf{w} = \lambda \mathbf{w} \).
  2. The top eigenvector gives the max-variance direction.
  3. Project data onto the leading eigenvectors.

Algorithm Walkthroughs

Step-by-step algorithm views

Select an algorithm to see the key steps and a matching visualization.

    Step 1 of 3

    Load data

    Training vs Inference

    Step-by-step ML lifecycle

    Training tunes the weights with data and gradients. Inference freezes the weights and just predicts.

      Step 1 of 5

      Load data batch

      Neural Networks: From Neurons to Transformers

      Zoom in from a single neuron all the way up to deep networks, CNNs, RNNs, LSTMs, and transformers. Each module walks step-by-step and animates the flow.

      Forward propagation refresher

      A network is stacked math. Inputs become features, hidden layers reshape them, and the output layer turns them into predictions.

      1. Input layer: normalized features (pixels, embeddings, sensor readings).
      2. Hidden layers: weighted sums + activations build new features.
      3. Output layer: scores or probabilities for each class.
      \[ \text{hidden} = \text{activation}(\text{input} \times W_1 + b_1) \] \[ \text{output} = \text{activation}(\text{hidden} \times W_2 + b_2) \]

      Inference stops here. Training continues with loss, backprop, and weight updates.

      Neural network architectures at a glance

      Different architectures specialize in different kinds of data: images, sequences, or long-range context.

      • Feedforward nets handle tabular features and classic classification.
      • CNNs are spatial pattern detectors for images and video.
      • RNNs and LSTMs handle sequences over time.
      • Transformers scale to long context across domains.

      Feedforward (MLP)

      Classic dense layers for structured data.

      Deep Neural Network

      Stacks many layers to build feature hierarchies.

      Convolutional Network

      Filters slide across images to find patterns.

      Recurrent Network

      State flows through time for sequences.

      LSTM

      Gated memory handles long-range signals.

      Transformer

      Self-attention mixes all tokens at once.

      Neuron basics: weighted sum -> activation

      Each neuron multiplies inputs by weights, adds a bias, then runs an activation function.

      1. Inputs arrive as numbers.
      2. Weights scale each input.
      3. Bias shifts the total.
      4. Activation decides the output.
      \[ z = \sum_i x_i w_i + b, \quad a = f(z) \]

      Inputs x [0, 0, 0]
      Weights w [0, 0, 0]
      Weighted sum z 0
      Bias b 0
      Activation a 0
      Step 1 of 4

      Inputs arrive

      Digit recognition: inference vs training

      Simple digit recognition starts from pixels, transforms them into hidden features, and predicts a digit.

      Inference (predict)

      1. Read the image.
      2. Flatten + normalize.
      3. Hidden activations.
      4. Output probabilities.

      Training (learn)

      1. Forward pass (same as inference).
      2. Compute loss vs label.
      3. Backpropagate gradients.
      4. Update weights.

      Target -
      Prediction -
      Loss -
      Step 1 of 4

      Image input

      Backpropagation: step-by-step

      Backpropagation computes how each weight should change to reduce the loss.

      1. Forward pass: compute predictions.
      2. Loss: compare prediction to target.
      3. Backward pass: send gradients backward.
      4. Update: adjust weights with the learning rate.
      \[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial w} \]
      Step 1 of 4

      Forward pass

      Deep neural networks: stacking layers

      More layers let the model build a hierarchy of features.

      • Early layers: edges and simple shapes.
      • Middle layers: textures and parts.
      • Late layers: whole objects or concepts.

      Examples: face recognition, speech-to-text, medical scans.

      Depth

      CNNs: slide a filter across the image

      Convolutions scan a small filter across pixels to detect patterns like edges or corners.

      • Shared weights detect the same pattern anywhere.
      • Each output cell is a dot product of patch and kernel.
      • Pooling summarizes nearby activations.
      • Stacked filters build complex features.
      \[ Y_{i,j} = \sum_{u=0}^{k-1}\sum_{v=0}^{k-1} X_{i+u, j+v} K_{u,v} + b \] \[ H_{out} = H - k + 1 \]

      Patch -
      Conv sum -
      Step 1 of 16

      Filter sweep

      RNNs: memory over time

      Recurrent networks reuse the same weights at every time step, passing a hidden state forward.

      • Hidden state stores short-term memory.
      • Same matrices \(W_x, W_h\) are reused at each step.
      • Great for sequences like text or sensor data.
      • Outputs can appear at every step.
      \[ h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = \mathrm{softmax}(W_y h_t) \]

      Token -
      Input x_t -
      Prev hidden h_{t-1} -
      Hidden h_t -
      Next output -
      Step 1 of 4

      Time step 1

      LSTM: long-short term memory

      LSTMs add gates that decide what to forget, what to write, and what to output.

      • Forget gate keeps or drops old memory.
      • Cell state carries long-range info additively.
      • Input gate writes new information.
      • Output gate reveals the right part.
      \[ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \] \[ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad \tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \] \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \quad h_t = o_t \odot \tanh(c_t) \]

      Memory before -
      Forget gate -
      Input gate -
      Candidate \(\tilde{c}_t\) -
      Memory after -
      Output gate -
      Hidden h_t -
      Step 1 of 4

      Forget gate

      Transformers: staged pipeline

      Transformers process all tokens at once, then use attention to mix information.

      1. Tokenize and embed.
      2. Add positional encoding.
      3. Self-attention mixes context.
      4. Feedforward + residual.
      5. Output probabilities.

      Tokens -
      Focus token -
      Stage -
      Stage 1 of 5

      Tokenize + embed

      Advanced Transformer Model: step-by-step visualization

      Use the stepper to follow the full Transformer pipeline without leaving the visuals.

      Step 1 of 15

      Tokenization

        Model GPT-2 (small)
        Embedding dim 768
        Heads 12
        Blocks 12
        Vocabulary 50,257
        Reference narrative (full explanation)

        What is a Transformer?

        Transformer is a neural network architecture that has fundamentally changed the approach to artificial intelligence. It was introduced in the 2017 paper "Attention is All You Need" and now powers models like OpenAI GPT, Meta Llama, and Google Gemini.

        Transformers are not limited to text. They also drive audio generation, image recognition, protein structure prediction, and game playing, showing how broadly the architecture applies across domains.

        Text-generative Transformers operate on next-token prediction: given a prompt, the model estimates the most probable next token. The core innovation is self-attention, which lets all tokens communicate and capture long-range dependencies.

        Transformer Explainer is powered by GPT-2 (small) with 124 million parameters. While it is not the largest model, its components match the structure of more recent systems.

        Transformer architecture

        Every text-generative Transformer consists of three components:

        • Embedding: tokens become vectors and receive positional signals.
        • Transformer block: multi-head attention routes context; an MLP refines each token.
        • Output probabilities: logits are turned into a probability distribution for the next token.

        Embedding pipeline

        Suppose the prompt is: "Data visualization empowers users to". Embedding converts this text into a numerical representation in four steps:

        1. Tokenization: split into word or subword tokens.
        2. Token embedding: map tokens to a vector space.
        3. Positional encoding: add position information.
        4. Final embedding: sum token and positional vectors.

        Figure 1. Expanding the embedding view: tokenization, token embedding, positional encoding, and final embedding.

        GPT-2 (small) uses 768-dimensional embeddings and a vocabulary of 50,257 tokens. The embedding matrix has shape (50,257, 768) with about 39 million parameters.

        Transformer block

        The block combines multi-head self-attention and an MLP. GPT-2 (small) stacks 12 blocks, allowing token representations to evolve into higher-level meanings over depth.

        Multi-head attention captures context across tokens. The MLP processes each token independently to refine its representation.

        Multi-head self-attention

        Each token is transformed into Query (Q), Key (K), and Value (V) vectors:

        \[ QKV_{ij} = \left( \sum_{d=1}^{768} \text{Embedding}_{i,d} \cdot \text{Weights}_{d,j} \right) + \text{Bias}_j \]

        Query is like the search text, Key is the result title, and Value is the page content. This analogy helps explain why attention scores route information from relevant tokens.

        Figure 2. Computing Q, K, and V from the embedding.

        • Split into heads: GPT-2 uses 12 attention heads.
        • Masked attention: prevents peeking at future tokens.
        • Concat + projection: heads are merged for the next stage.

        Scaling and masking set future positions to negative infinity so each token predicts without looking ahead.

        Figure 3. Masked self-attention: dot product, scale + mask, softmax + dropout.

        MLP: multi-layer perceptron

        The MLP expands each token from 768 to 3072 dimensions with a GELU activation, then compresses back to 768. This enriches the representation independently per token.

        Figure 4. The MLP expands then compresses each token representation.

        Output probabilities and sampling

        The final linear layer maps to 50,257 logits, one for each token in the vocabulary. Softmax turns logits into probabilities for the next token.

        Figure 5. Each token receives a probability from the output logits.

        Temperature controls sharpness: T=1 keeps logits unchanged, T<1 makes outputs more deterministic, and T>1 increases randomness.

        Temperature

        Lower values make outputs more deterministic; higher values add creativity.

        Top-k

        Restrict sampling to the top k highest-probability tokens.

        Top-p

        Sample from the smallest set whose cumulative probability exceeds p.

        Auxiliary architectural features

        Layer normalization stabilizes training and is applied twice per block. Dropout regularizes by randomly deactivating units during training and is disabled during inference. Residual connections add skip paths twice per block, helping gradients flow and preventing vanishing gradients.

        The overview highlights where you are in the stack; the detail panel shows the math.

        Large Language Models: Guided Tutorial

        A hands-on path that ties prompting, model behavior, and agent building into one workflow.

        Step 1

        Frame the task

        Define the role, goal, and success criteria before you write prompts.

        Goal: a clear, testable objective.
        Step 2

        Structure the prompt

        Use context blocks, constraints, and a target format.

        Goal: predictable, repeatable outputs.
        Step 3

        Inspect model behavior

        Adjust temperature, context, and stop rules to stabilize responses.

        Goal: reduce variance and hallucinations.
        Step 4

        Add tools

        Introduce tool calling for data access or automation.

        Goal: verified outputs with traceability.
        Step 5

        Evaluate and iterate

        Write evals, measure regressions, and refine.

        Goal: steady improvements with evidence.

        Quick tutorial checklist

        1. Collect 5 real inputs and draft a baseline prompt.
        2. Define the output format and failure cases.
        3. Create a tool schema for any external data.
        4. Write 5 evals with expected results.
        5. Iterate until the eval pass rate stabilizes.
        Mini mission: run a 5-example eval before you ship.

        Tutorial starter prompt

        ROLE: You are an AI PM.
        TASK: Draft a 1-page feature brief.
        CONTEXT:
        """
        Include 5 user notes, metrics, and constraints.
        """
        CONSTRAINTS: 250 words max. Bullet format.
        OUTPUT FORMAT:
        - Problem
        - Users
        - Constraints
        - Proposal
        - Risks
        CHECKS: List missing inputs.

        Transformer Depth Map: From Mechanics to Reasoning Models

        Use this depth indicator to move from core mechanics to training signals and reasoning behaviors. Each tier adds a new layer of capability.

        Depth 1

        Transformer fundamentals

        Mechanics and intuition.

        Tokens Q K V Attention FFN Residual + Norm

        Tokens turn into Q/K/V, attention mixes them, FFN refines, residual + norm stabilize.

        The big picture

        Input Enc Dec Out

        Transformers model sequences by letting every token attend to every other token in one step, so computation is parallel instead of strictly sequential.

        Encoder-decoder stacks build a memory from the input, while decoder-only stacks generate the next token with a causal mask.

        Scaled dot-product attention

        Q K V QK^T Softmax Sum Out

        Queries and keys produce similarity scores, softmax turns scores into weights, and values are mixed by a weighted sum.

        Scaling by sqrt(d) keeps logits in a stable range so softmax does not saturate too early.

        Multi-head + residuals + layer norm

        In H1 H2 H3 Concat LN

        Each head attends in its own subspace, the heads are concatenated, and a projection mixes them back together.

        Residual paths keep the original signal available, while layer norm stabilizes scale across the block.

        Positional encoding

        t1 t2 t3 t4 PE

        Attention is permutation-invariant, so we add position signals to token embeddings before attention.

        Sinusoidal encodings give each position a unique mix of frequencies; learned or relative schemes encode distance directly.

        Depth 2

        Transformer into an LLM

        Training objectives and variants.

        Decoder-only Encoder-only Next-token Masked LM

        Decoder-only uses causal masks for next-token prediction; encoder-only sees both directions with masked tokens.

        Decoder-only transformers (GPT-style)

        Causal mask

        A causal mask blocks attention to future tokens, so each position can only use past context.

        Training uses teacher forcing: the model sees true previous tokens while learning to predict the next one.

        Encoder-only transformers (BERT-style)

        t1 MASK t3 t4 Bidirectional

        Masked language modeling hides tokens and uses both left and right context to reconstruct them.

        The result is a strong bidirectional representation that transfers well to classification and retrieval tasks.

        Training + inference basics

        logits Softmax Greedy Sample

        Logits become probabilities via softmax, then decoding picks the next token.

        Greedy decoding is stable, while temperature and top-p sampling trade determinism for diversity.

        Depth 3

        Reasoning with transformers

        Prompting and tool-use strategies.

        CoT Vote Self-consistency LLM Tool ReAct

        CoT adds intermediate steps, self-consistency samples many paths, ReAct loops the model with tools.

        Chain-of-Thought (CoT)

        Step 1 Step 2 Step 3 Ans

        CoT encourages explicit intermediate steps, which can improve multi-step reasoning tasks.

        Use it when the reasoning path matters, but keep in mind verbosity does not guarantee correctness.

        Self-consistency

        Vote

        Self-consistency samples multiple reasoning paths and aggregates the final answers.

        Majority voting reduces variance and often improves accuracy on reasoning benchmarks.

        ReAct (reason + act)

        LLM Tool loop

        ReAct alternates reasoning steps with tool calls, grounding answers in external data.

        The loop keeps the model honest by injecting retrieved facts before the final response.

        Depth 4

        Reasoning models today

        Training signals and test-time search.

        Process supervision Test-time search Reasoning behavior

        Step-level feedback and search trees both add compute that shapes reasoning at test time.

        Process vs outcome supervision

        Reward

        Process supervision scores intermediate steps, not just the final answer.

        Step-level signals make credit assignment clearer and improve reasoning stability.

        Tree-of-Thoughts (ToT)

        Score

        Tree-of-Thoughts explores multiple branches, scores them, and expands the best candidates.

        Search with backtracking can outperform a single forward pass on hard problems.

        What "reasoning model" means today

        Prompt Search Supervise Behavior

        Modern reasoning blends prompting patterns, test-time search, and step-aware supervision.

        The behavior you see is shaped by where extra compute or feedback is injected.

        Prompt Engineering: Crafting Clear Instructions

        Design prompts that reduce ambiguity, keep responses grounded, and make iteration fast.

        Prompt anatomy

        • Role: who the model is acting as.
        • Goal: the task and the success criteria.
        • Context: background data, definitions, and scope.
        • Constraints: length, tone, must include, must avoid.
        • Output format: schema, bullets, JSON, table.
        • Checks: ask for missing info or confidence notes.
        Mini mission: rewrite a vague request into a 6-part prompt.

        Pattern: Delimiters + grounding

        Use clear separators so the model knows what to quote, summarize, or transform.

        Pattern: Few-shot exemplars

        Add one or two examples that match the output style you want.

        Pattern: Self-checks

        Ask for a brief verification step or uncertainties before the final answer.

        Prompt template

        ROLE: You are a product analyst.
        TASK: Summarize customer feedback into 3 themes.
        CONTEXT:
        """
        Paste notes here.
        """
        CONSTRAINTS: Max 8 bullets. No speculation.
        OUTPUT FORMAT:
        - Theme:
        - Evidence:
        CHECKS: Flag missing data.

        Prompt technique examples

        Technique: Role + constraints
        ROLE: You are a support analyst.
        TASK: Summarize the ticket in 3 bullets.
        CONSTRAINTS: Neutral tone. No jargon.
        
        Technique: Few-shot style match
        TASK: Turn notes into a decision line.
        EXAMPLE:
        Input: "Latency 120ms, budget ok"
        Output: "Decision: proceed with rollout"
        
        Technique: Delimiters + extraction
        Extract action items from <notes>...</notes>.
        Return JSON with owner, task, and due.

        Prompt refinement loop

        1. Draft the instruction and the output format.
        2. Test with a real example and inspect the gaps.
        3. Add constraints, examples, or definitions.
        4. Repeat until outputs are stable and on-brand.

        LLM Fundamentals: How Models Respond

        Know the mechanics behind next-token prediction so your prompts behave consistently.

        Core behaviors

        • LLMs predict the next token based on all prior context.
        • Context windows limit how much text the model can see at once.
        • Sampling settings trade off creativity vs stability.
        • System instructions set global rules; user prompts set tasks.
        • Hallucinations appear when the prompt is underspecified.
        Temperature

        Controls randomness; lower is more deterministic.

        Top-p

        Limits choices to a probability mass.

        Max tokens

        Budget for output length and cost.

        Stop sequences

        End outputs at safe boundaries.

        Prompting best practices

        See the full playbook at promptingguide.ai.

        • Be explicit about the role, task, and constraints.
        • Use delimiters to separate instructions from data.
        • Provide a target format and a short example.
        • Ask for assumptions, risks, or missing data.
        • Iterate with real inputs, not toy prompts.

        Context hygiene

        • Keep only relevant context; remove stale instructions.
        • Define terms once and reuse them consistently.
        • Chunk long inputs and summarize key decisions.

        Vector Databases: Indexing and Retrieval for LLMs

        See how embeddings get indexed, how similarity search works, and how results return to the model.

        Indexing views

        Indexing pipeline

        Chunk documents, embed each chunk, and store vectors with metadata for filtering.

        • Split docs into overlapping chunks.
        • Embed each chunk into a vector space.
        • Write vectors and metadata to the index.
        Metadata filters

        Tags, permissions, and timestamps narrow search.

        Chunk size

        Balances recall with context length.

        ANN indexing with HNSW

        Hierarchical graph layers speed up nearest-neighbor search with high recall.

        • Insert vectors into a multi-layer graph.
        • Connect each node to its closest neighbors.
        • Search from top layer to dense base layer.
        M parameter

        Controls graph connectivity and recall.

        efSearch

        Higher values improve recall at a latency cost.

        IVF + PQ compression

        Inverted file indexing narrows search to the closest centroids, then product quantization compresses vectors for fast distance estimates.

        • Assign vectors to coarse centroids (IVF).
        • Search only the nearest lists.
        • Quantize subvectors into compact PQ codes.
        nlist

        Number of coarse clusters to search.

        m * bits

        Subvector count and code size per block.

        Search and retrieval for LLMs

        Blend lexical and vector results, then rerank and pack the best chunks into context.

        • Embed the query and run ANN search.
        • Merge lexical and vector candidates.
        • Rerank and assemble a grounded context window.
        Reranking

        Cross-encoders boost precision on top results.

        Citations

        Keep track of sources for trust and audits.

        Mini mission: trace where the top-k chunks enter the prompt.

        Switch tabs or cycle the diagram.

        Sutskever's List: Core Readings

        A curated reading list with the main themes and ideas captured for quick scanning.

        Transformers Year: 2017

        The Annotated Transformer

        Annotated walkthrough of the Transformer model that uses self-attention for sequence tasks.

        • Theme: Attention-based encoder-decoder design.
        • Main idea: Line-by-line implementation clarifies attention, masking, and positional encoding.
        Theory Year: n/a

        The First Law of Complexodynamics

        Blog essay connecting complexity, computation, and principles relevant to AI foundations.

        • Theme: Complexity growth in closed systems.
        • Main idea: Links entropy, computation, and limits of organized structure.
        Sequence Models Year: n/a

        Understanding LSTM Networks

        Christopher Olah's visual and intuitive explanation of LSTM mechanisms and gates.

        • Theme: Gating and long-term memory.
        • Main idea: Input/forget/output gates control information flow and gradients.
        Sequence Models Year: 2014

        Recurrent Neural Network Regularization

        Shows how regularization like dropout improves recurrent architectures.

        • Theme: Regularization for recurrent nets.
        • Main idea: Dropout-style methods improve generalization without breaking sequence learning.
        Sequence Models Year: 2015

        Pointer Networks

        Neural architecture for solving combinatorial problems by learning pointer outputs.

        • Theme: Attention as a pointer mechanism.
        • Main idea: Output indices of input elements for variable-length solutions.
        Sequence Models Year: 2015

        Order Matters: Sequence to Sequence for Sets

        Explores how ordering impacts seq2seq model performance.

        • Theme: Set prediction with seq2seq.
        • Main idea: Learning an output order improves performance on set tasks.
        Vision Year: 2015

        Deep Residual Learning for Image Recognition

        ResNet paper that introduced residual connections to enable very deep models.

        • Theme: Residual learning.
        • Main idea: Identity skip connections allow very deep CNNs to train.
        Graph Networks Year: 2017

        Neural Message Passing for Quantum Chemistry

        Graph neural network model for learning on structured data.

        • Theme: Message-passing GNNs.
        • Main idea: Iterative node updates capture molecular structure for property prediction.
        Transformers Year: 2017

        Attention Is All You Need

        Introduced the Transformer with self-attention, the foundation of modern NLP and LLMs.

        • Theme: Self-attention for sequence modeling.
        • Main idea: Replace recurrence with attention for parallel training and long-range context.
        Vision Year: 2016

        Identity Mappings in Deep Residual Networks

        Improves ResNet training using identity skip connections.

        • Theme: Optimization of deep residual networks.
        • Main idea: Pre-activation and identity skips improve gradient flow.
        Generative Models Year: 2015

        Variational Lossy Autoencoder

        Combines autoencoder with variational losses for generative modeling.

        • Theme: Lossy compression in generative models.
        • Main idea: Balance reconstruction with latent codes that capture global structure.
        Sequence Models Year: 2018

        Relational Recurrent Neural Networks

        Combines relational reasoning with RNN mechanisms.

        • Theme: Memory + relational reasoning.
        • Main idea: Memory slots interact through attention to improve sequence modeling.
        Memory Networks Year: 2014

        Neural Turing Machines

        Early memory-augmented neural network with external controller.

        • Theme: Differentiable external memory.
        • Main idea: A controller learns to read and write memory for algorithmic tasks.
        Speech Year: 2015

        Deep Speech 2: End-to-End Speech Recognition

        End-to-end speech model demonstrating RNN and CNN integration.

        • Theme: End-to-end speech recognition.
        • Main idea: CNN + RNN stacks with CTC scale to large speech datasets.
        Scaling Year: 2020

        Scaling Laws for Neural Language Models

        Shows how model and data scaling improve language model performance.

        • Theme: Empirical scaling behavior.
        • Main idea: Performance follows power laws in data, parameters, and compute.
        Theory Year: n/a

        A Tutorial Introduction to the Minimum Description Length Principle

        Introductory explanation of MDL principle connecting compression and learning.

        • Theme: Compression-based model selection.
        • Main idea: Better compression implies better generalization.
        Theory Year: n/a

        Machine Super Intelligence

        Discusses theoretical aspects of AGI and intelligence measures.

        • Theme: Theoretical intelligence measures.
        • Main idea: Frames limits and metrics for machine intelligence.
        Theory Year: n/a

        Kolmogorov Complexity and Algorithmic Randomness

        Foundations of algorithmic complexity and information theory.

        • Theme: Algorithmic information theory.
        • Main idea: Randomness is defined by shortest program length.

        AI Agents: Tool-Using Systems

        Agents combine planning, tools, memory, and evals to finish real tasks.

        Agent loop

        1. Plan the task and choose a strategy.
        2. Select tools and call them with structured inputs.
        3. Observe results and update the plan.
        4. Decide when to stop and respond.

        Planner

        Breaks the goal into steps and picks the next action.

        Tool router

        Chooses APIs, search, or code execution based on the task.

        Memory

        Stores notes, intermediate results, and long-term facts.

        Critic

        Checks outputs, runs evals, and flags regressions.

        Tool calling basics

        • Define clear tool schemas with typed inputs and outputs.
        • Validate tool results and handle failures explicitly.
        • Constrain tool access with allowlists and budgets.
        • Log tool calls for tracing and debugging.

        How to write evals

        1. Define a task set and success criteria.
        2. Collect examples with expected outputs.
        3. Use automatic checks plus human review when needed.
        4. Track regressions and edge cases over time.

        Eval-driven development

        1. Write evals before adding new agent logic.
        2. Run evals on every change to the prompt or tools.
        3. Promote only the changes that improve the score.
        4. Version datasets and prompts alongside code.
        Frameworks

        LangChain, LlamaIndex, and OpenAI or Anthropic SDKs.

        Tracing

        Capture tool calls, prompts, and outputs for audits.

        Evals

        Use unit tests, golden sets, and regression suites.

        Safety

        Guardrails for tool use, data access, and output policy.

        Reinforcement Learning Fundamentals

        Agents learn by interacting with an environment, collecting rewards, and improving a policy.

        Reinforcement learning (RL) is a learning framework where an agent chooses actions in a state, receives rewards, and updates its policy to maximize long-term return.

        \[ \text{Goal: } \max_\pi \; \mathbb{E}_\pi \left[\sum_{t=0}^{\infty} \gamma^t r_t \right] \]

        Core concepts

        • Agent & environment: the learner and the world it interacts with.
        • Episode / trajectory: a sequence of states, actions, and rewards.
        • Reward vs return: immediate feedback \(r_t\) vs discounted sum \(G_t\).
        • Discount \(\gamma\): weighs future rewards.
        • Policy \( \pi(a|s) \): behavior rule.
        • Value \( V^\pi(s) \), Action-value \( Q^\pi(s,a) \).
        • Optimal \( V^*, Q^* \): best achievable values.

        Minimal environment interface

        reset(seed?) -> observation/state
        step(action) -> { nextState, reward, done, info }
        actions(state) -> action list
        // render helpers stay separate from logic

        reset(seed?) starts a new episode and returns the initial state (optionally deterministic with a seed).

        step(action) applies an action and returns the transition tuple: next state, reward, terminal flag, and any extra info.

        actions(state) exposes valid actions so the agent can plan or explore safely.

        Render helpers stay separate so learning logic is deterministic and testable.

        \[ V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_{t} \mid s_0 = s \right] \] \[ Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_{t} \mid s_0 = s, a_0 = a \right] \]

        Return calculator

        0.90
        Return \(G\) = 0.00
        \[ G = r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \cdots \]

        Policy vs value (3-state chain)

        0.50

        Reinforcement Learning: a gentle mathematical walkthrough

        RL formalizes learning by experience: act, observe, update, repeat.

        1. Interaction over time

        At each step the agent is in a state \(s\), takes an action \(a\), receives a reward \(r\), and lands in a new state \(s'\).

        \[ s_0 \rightarrow a_0 \rightarrow r_1 \rightarrow s_1 \rightarrow a_1 \rightarrow r_2 \rightarrow s_2 \rightarrow \cdots \]

        This sequence is a trajectory (episode).

        2. Rewards vs return

        Rewards are immediate feedback. Returns add up future rewards with discounting.

        \[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \]
        • \(\gamma = 0\): only care about now.
        • \(\gamma \approx 1\): care about long-term outcomes.

        3. Policy

        A policy tells the agent how to act:

        \[ \pi(a \mid s) \]
        • Deterministic: always take the same action.
        • Stochastic: choose actions by probability.

        4. Value functions

        Value functions measure how good states or actions are under a policy.

        \[ V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s \right] \] \[ Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s, a_0=a \right] \]

        V evaluates a state; Q evaluates a decision.

        5. Bellman equations

        Value is defined recursively: immediate reward plus discounted value of the next state.

        \[ V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right] \] \[ V^*(s) = \max_a \sum_{s'} P(s' \mid s,a)\left[R(s,a,s') + \gamma V^*(s')\right] \]

        6. Dynamic programming (model known)

        If transitions and rewards are known, compute values directly.

        \[ V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s,a)\left[R + \gamma V_k(s')\right] \]

        Policy iteration alternates evaluation and greedy improvement.

        7. Monte Carlo methods

        Learn from complete episodes when the model is unknown.

        • Generate an episode and compute the return.
        • Update the first visit of each state (or state-action).
        • Unbiased but high variance; must wait for episode end.

        8. Temporal Difference (TD)

        Update values online using one-step bootstrapping.

        \[ V(s) \leftarrow V(s) + \alpha \left[r + \gamma V(s') - V(s)\right] \] \[ \delta = r + \gamma V(s') - V(s) \]

        \(\delta\) is the TD error: how wrong the prediction was.

        9. Learning control

        Learn the best actions, not just values.

        \[ Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma Q(s',a') - Q(s,a)\right] \] \[ Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right] \]

        SARSA uses the next action taken; Q-learning uses the best possible next action.

        10. Why deep RL exists

        Tables do not scale to high-dimensional states or continuous actions.

        • Approximate \(Q(s,a)\) or \(\pi(a \mid s)\) with neural networks.
        • Use experience replay to break correlation.
        • Use target networks to stabilize learning.
        • Optimize parameters with gradient descent.

        Big picture summary

        Reward

        Immediate feedback signal.

        Return

        Discounted sum of future rewards.

        Policy

        Behavior rule \(\pi(a \mid s)\).

        Value

        How good a state or action is.

        Bellman

        Recursive value definitions.

        Monte Carlo

        Learn from full episodes.

        TD learning

        Learn step-by-step with bootstrapping.

        Q-learning

        Learn optimal behavior off-policy.

        Deep RL

        Scale RL with neural networks.

        Multi-Armed Bandits

        Explore exploration strategies and track regret as the agent learns.

        Algorithms

        • \(\varepsilon\)-greedy with optional decay.
        • UCB1 balances optimism and uncertainty.
        • Thompson Sampling for Bernoulli rewards.

        What is a multi-armed bandit? An agent repeatedly chooses among \(K\) actions (arms) with unknown reward distributions. The goal is to maximize total reward by balancing exploration (learn the arms) and exploitation (use the best-known arm).

        \(\varepsilon\)-greedy

        1. Initialize action-value estimates and counts.
        2. At each step, explore with probability \(\varepsilon\); otherwise exploit the best estimate.
        3. Pull the chosen arm and observe the reward.
        4. Update that arm’s estimate with an incremental mean.
        5. Optionally decay \(\varepsilon\) to reduce exploration over time.

        UCB1

        1. Initialize estimates and counts; try each arm at least once.
        2. Compute the UCB score: \( \hat{\mu}_a + c \sqrt{\log t / n_a} \).
        3. Select the arm with the highest UCB score.
        4. Observe the reward and update the estimate and count.
        5. Repeat to balance mean reward and uncertainty.

        Thompson Sampling (Bernoulli)

        1. Initialize Beta priors \(\alpha, \beta\) for each arm.
        2. Sample a success probability from each arm’s Beta distribution.
        3. Choose the arm with the highest sampled probability.
        4. Observe reward (success/failure) and update \(\alpha, \beta\).
        5. Repeat to naturally balance exploration and exploitation.
        5
        200
        0.20
        0.20
        0.000
        1.4

        Chart 1. True arm means (light) vs estimated means (dark).

        Chart 2. Cumulative regret over time.

        Chart 3. Action selection frequency by arm.

        MDP + Dynamic Programming (Gridworld)

        Solve a Markov Decision Process with Bellman updates and visualize value and policy.

        MDP setup

        • MDP \( (S, A, P, R, \gamma) \) defines dynamics and rewards.
        • Bellman expectation: \( V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R + \gamma V^\pi(s')] \).
        • Bellman optimality: \( V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R + \gamma V^*(s')] \).

        Deep dive: Markov Decision Process. An MDP assumes the future depends only on the current state and action (the Markov property). Transitions \(P(s'|s,a)\) and rewards \(R(s,a,s')\) define how the agent moves. Dynamic programming solves the MDP by repeatedly applying Bellman backups until values and policies converge.

        0.90
        -0.04
        0.00
        0

        Policy evaluation (iterative)

        1. Initialize \(V(s)\) for all states.
        2. For each state, compute the Bellman expectation backup under \(\pi\).
        3. Update \(V(s)\) with the new estimate.
        4. Repeat until the max change \(\Delta V\) is below a threshold.

        Policy improvement

        1. Use the current \(V(s)\) to compute \(Q(s,a)\).
        2. For each state, choose the greedy action \( \arg\max_a Q(s,a) \).
        3. Update the policy to be greedy (or \(\varepsilon\)-greedy).
        4. Repeat after re-evaluating the policy.

        Policy iteration

        1. Initialize a policy \(\pi\).
        2. Evaluate \(\pi\) to compute \(V^\pi\).
        3. Improve \(\pi\) by acting greedy with respect to \(V^\pi\).
        4. Stop when the policy stops changing.

        Value iteration

        1. Initialize \(V(s)\) arbitrarily.
        2. Apply the optimality backup \( V(s) \leftarrow \max_a \sum_{s'} P [R + \gamma V(s')] \).
        3. Repeat until \(V\) converges.
        4. Extract the greedy policy from the converged values.

        Inspect state

        Click a cell to see Q(s,·).

        Grid. Values \(V(s)\), heatmap, and policy arrows per state.

        Chart. Max \(\Delta V\) and average \(V\) per iteration.

        Monte Carlo Methods

        Estimate values from complete episode returns.

        Algorithms

        • First-visit MC prediction for \(V^\pi\).
        • On-policy MC control with \(\varepsilon\)-soft policies.

        Deep dive: Monte Carlo methods. MC methods wait until an episode ends, then use the realized return to update value estimates. They are unbiased but can have high variance, so averaging many episodes stabilizes learning.

        First-visit MC prediction

        1. Generate a full episode using policy \(\pi\).
        2. For each state, find the first time it appears in the episode.
        3. Compute the return \(G_t\) from that first visit to the end.
        4. Update \(V(s)\) with the average of observed returns.
        5. Repeat with many episodes to reduce variance.

        On-policy MC control (\(\varepsilon\)-soft)

        1. Initialize \(Q(s,a)\) and a soft policy.
        2. Generate an episode following the current policy.
        3. Compute returns for each state-action first visit.
        4. Update \(Q(s,a)\) with averaged returns.
        5. Improve the policy to be \(\varepsilon\)-greedy w.r.t. \(Q\).
        0.20

        Grid. Episode rollout animation and evolving \(V(s)\) heatmap.

        Chart. Returns histogram (recent episodes).

        Chart. Value estimate of the start state over episodes.

        Temporal Difference Learning

        Blend bootstrapping with sampling for faster learning.

        TD(0) prediction

        • Update rule: \( V(s) \leftarrow V(s) + \alpha [r + \gamma V(s') - V(s)] \).
        • Random-walk demo with 5 states.

        TD(0) prediction

        1. Initialize \(V(s)\) for all states.
        2. Observe a transition \( (s, r, s') \).
        3. Compute the TD error \( \delta = r + \gamma V(s') - V(s) \).
        4. Update \(V(s) \leftarrow V(s) + \alpha \delta \).
        5. Repeat over many episodes to converge.
        0.10

        Chart. Current value estimates for the random-walk states.

        Chart. TD error \(\delta_t\) over a single episode.

        Chart. Value estimates per state across episodes.

        Model-Free Control

        Learn optimal policies directly from experience.

        Algorithms

        • SARSA: on-policy TD control.
        • Q-learning: off-policy TD control.
        • Expected SARSA: expectation over policy.

        SARSA

        1. Initialize \(Q(s,a)\) and choose a behavior policy.
        2. Take action \(a\), observe \(r, s'\), and pick next action \(a'\) from the same policy.
        3. Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s',a') - Q(s,a)] \).
        4. Set \(s \leftarrow s'\), \(a \leftarrow a'\) and repeat.

        Q-learning

        1. Initialize \(Q(s,a)\) and choose a behavior policy (e.g., \(\varepsilon\)-greedy).
        2. Take action \(a\), observe \(r, s'\).
        3. Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \).
        4. Repeat until Q-values converge to \(Q^*\).

        Expected SARSA

        1. Initialize \(Q(s,a)\) and a stochastic policy.
        2. Observe \(r, s'\) after taking action \(a\).
        3. Compute expected next value \( \sum_{a'} \pi(a'|s') Q(s',a') \).
        4. Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \mathbb{E}_{a'}Q(s',a') - Q(s,a)] \).
        0.20
        0.20

        Grid. Greedy policy arrows from the current Q-table.

        Chart. Q-value heatmaps per action (Up/Right/Down/Left).

        Chart. Exploration schedule \(\varepsilon\) over time.

        Chart. Episode return over training.

        Deep RL Concepts

        When state spaces grow, deep networks approximate value functions or policies.

        Core ideas

        • Policy gradients optimize expected return directly.
        • Replay buffers stabilize off-policy learning.
        • Target networks reduce moving-target instability.

        Policy gradient (REINFORCE-style)

        1. Collect trajectories by sampling from \(\pi_\theta(a|s)\).
        2. Compute returns or advantages for each action.
        3. Estimate the gradient \( \nabla_\theta \log \pi_\theta(a|s) \hat{A} \).
        4. Update parameters \( \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) \).
        5. Repeat with variance reduction (baseline) for stability.

        Deep Q-learning (conceptual)

        1. Collect transitions into a replay buffer.
        2. Sample mini-batches uniformly from the buffer.
        3. Compute target \( r + \gamma \max_{a'} Q_{\text{target}}(s',a') \).
        4. Minimize the TD loss between \(Q_\theta\) and targets.
        5. Periodically sync the target network.
        \[ \nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) \, \hat{A}(s,a)] \]

        Deep RL diagnostic loop

        1. Collect experience into a replay buffer.
        2. Sample mini-batches for stable updates.
        3. Sync target networks periodically.
        4. Track reward curves and policy entropy.

        Policy gradient intuition

        Increase the probability of actions that led to higher return and decrease others.

        \[ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \, \hat{A} \]

        Jupyter-Style Math Notebook

        Work through the foundations topics in order. Each cell runs in a shared kernel for a practical hands-on workshop.

        Python kernel Idle

        Notebook goal

        Cover the Foundations math with a practical hands-on workshop: linear algebra, calculus, optimization, probability, activations, matrix calculus, plus PyTorch fundamentals and tutorials.

        Section 1: Python for math

        Lists, loops, and functions to prep for vector math.

        In [1] Numbers, lists, and summaries
        In [2] Functions and list math

        Section 2: Linear algebra (vectors + matrices)

        Dot products, norms, matrix-vector products, and matrix multiplication.

        In [3] Vector ops
        In [4] Matrices: matvec + matmul

        Section 3: Calculus + optimization

        Estimate derivatives and follow gradients downhill.

        In [5] Numerical derivative
        In [6] Gradient descent on a quadratic
        In [7] Plot a curve

        Section 4: Probability + activations

        Turn scores into probabilities and compare activation curves.

        In [8] Softmax + cross-entropy
        In [9] Activation functions
        In [10] Plot activation curves

        Section 5: Matrix calculus (batch gradients)

        Compute a vectorized gradient for linear regression.

        In [11] Batch gradient for linear regression

        Section 6: PyTorch fundamentals

        Tensors, autograd, modules, and optimizers. These cells print install guidance if PyTorch is unavailable.

        Install PyTorch for this lab

        This web lab runs in the browser and cannot install PyTorch. To run the PyTorch cells, open the notebook in local Jupyter or Colab and run the install cell below. For GPU builds, use the command from the PyTorch get-started page.

        In [12] Install PyTorch (local or Colab)
        In [13] Setup + device
        In [14] Tensors + matrix multiply
        In [15] Autograd basics
        In [16] Linear layer forward pass
        In [17] Loss + optimizer step

        Section 7: In-browser inference (ONNX Runtime Web)

        Run a small ONNX model directly in the browser using JavaScript.

        What this demo does

        Loads a tiny MNIST classifier and runs inference on simple 28x28 input patterns. No Python kernel required.

        If the model URL fails, use another ONNX model that accepts a 1x1x28x28 float tensor.

        JS MNIST inference demo

        Output will appear here.

        Idle