Artificial Intelligence Math Foundations

What is AI?

AI builds systems that can sense, decide, and improve with data.

Artificial Intelligence (AI) is the field of creating systems that perceive their world, reason or learn from data, and choose actions to reach goals.

Intuition: a curious student

AI studies examples, forms internal rules, then gets feedback to refine them.

Intuition: a navigator

AI observes a map, plans a route, acts, then updates the plan after each step.

Perceive → Represent → Learn/Plan → Act → Feedback

Goal: reduce error or maximize reward

Data Model Action

\[ \mathbf{x} \xrightarrow{f_\theta} \hat{\mathbf{y}} \quad L(\hat{\mathbf{y}}, \mathbf{y}) \quad \theta \leftarrow \theta - \eta \nabla_\theta L \]

AI categories in practice

Different AI families focus on different parts of the loop, but they often work together.

Symbolic & search

Rules, logic, planning, and graph search to make decisions.

Machine learning

Models that learn patterns from labeled or unlabeled data.

Deep learning

Neural networks for vision, speech, language, and high-dimensional signals.

Reinforcement learning

Agents learn policies through interaction and rewards.

Generative AI

Models that create text, images, or audio from learned distributions.

Embodied & robotics

Perception plus control to act in the physical world.

Most real systems blend categories, such as robots that plan with search and see with deep learning.

Foundations: Core Math Roadmap

A compact roadmap of the math that powers modern AI. Switch tabs to see each idea in action.

Step 1

Linear Algebra: Space Benders

Learn how vectors and matrices stretch, rotate, and move space.

Goal: read \( \mathbf{h} = W\mathbf{x} + \mathbf{b} \).

Step 2

Calculus: Change Detective

Understand slopes, tiny nudges, and the chain rule.

Goal: follow gradients through a network.

Step 3

Optimization: Mountain Hikes

Use gradients to walk downhill and find the best answers.

Goal: tune the learning rate \( \eta \).

Step 4

Probability: Confidence Radar

Turn scores into probabilities and measure mistakes.

Goal: use softmax and cross-entropy.

Step 5

Matrix Calculus: Fast Backprop

Compute gradients for whole batches at once.

Goal: track shapes and vectorized rules.

Neural networks are a stack of math ideas. Each tab below zooms in on one layer of that stack.

Linear algebra: what the network is

Think of a matrix as a magic machine that bends space. Each layer takes a vector and transforms it.

Matrices are necessary for deep learning because every layer is a matrix multiply that bundles thousands of weights, so we can compute many neuron activations and whole batches in one fast step.

\[ \mathbf{h} = W\mathbf{x} + \mathbf{b} \] \[ \mathbf{a} = \mathrm{ReLU}(\mathbf{h}) \]

Imagine a grid of points being stretched and tilted. That is what \(W\) does.

Vectors are arrows (direction + length).
Matrices are transformations (stretch, rotate, shear).
Layers stack these transformations with a nonlinearity.

Mini mission: pick a vector and predict where the grid sends it.

Calculus: how learning works

Derivatives tell us how fast things change. If the loss is a hill, the derivative tells the slope.

Let \(y = g(x)\) and \(L = f(y)\). A tiny nudge \(dx\) changes \(y\) by \(dy = g'(x)dx\), which changes the loss by \(dL = f'(y)dy\).

\[ y = g(x), \quad L = f(y) \] \[ \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} = f'(g(x)) \cdot g'(x) \]

Backprop is the chain rule applied again and again through the whole network.

Local slope \(\times\) upstream slope = new slope.
Each layer multiplies by its derivative and passes the signal back.

Mini mission: point to where the slope is steepest.

Optimization: gradient descent and learning rate

Once we know the slope, we take a step downhill to reduce the loss.

For a linear layer \( \mathbf{z} = W\mathbf{x} + \mathbf{b} \) with softmax + cross-entropy, the error at the logits is \( \hat{\mathbf{y}} - \mathbf{y} \).

\[ W \leftarrow W - \eta (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{x}^T, \quad \mathbf{b} \leftarrow \mathbf{b} - \eta (\hat{\mathbf{y}} - \mathbf{y}) \]

\(\eta\): learning rate (step size)
\(\hat{\mathbf{y}} - \mathbf{y}\): probability error signal from the loss
\(\mathbf{x}\): input features that scale the step
Too big → unstable or diverges
Too small → slow progress

Mini mission: choose a step size that reaches the valley without bouncing.

Probability + information theory

For classification, the network turns scores into probabilities and compares them to the truth.

\[ \mathbf{z} = W\mathbf{x} + \mathbf{b} \] \[ \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}), \quad \hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}} \] \[ L = -\sum_i y_i \log \hat{y}_i \]

Scores \(\mathbf{z}\): raw logits before normalization.
\(\hat{\mathbf{y}}\): softmax outputs probabilities that sum to 1.
\(\mathbf{y}\): the true label as a one-hot distribution.
Loss \(L\): cross-entropy penalizes low probability on the true class; minimizing \(L\) pushes \(\hat{\mathbf{y}}\) toward \(\mathbf{y}\).

Loss derivation: with \( \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}) \) and one-hot \( \mathbf{y} \), the log-softmax derivative gives a simple gradient.

\[ \frac{\partial L}{\partial z_j} = \sum_i y_i (\hat{y}_j - \delta_{ij}) = \hat{y}_j - y_j \]

Result: \( \nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y} \), the signal that drives backprop.

Mini mission: spot the biggest probability bar.

Matrix calculus: the speed path

To train fast, we compute gradients for a whole batch at once. Think of matrices as a neat way to stack many examples and do one big calculation.

\(X\): a batch of inputs (rows are examples).
\(W\): weights (columns are feature-to-output recipes).
\(Y\): outputs for the whole batch.

\[ Y = XW \] \[ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T \] \[ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} \]

Intuition: each gradient formula is "reverse flow" of the forward pass. We transpose the weight matrix to send signals back to the inputs, and we combine all examples with \(X^T\) to update the weights in one shot.

Forward pass: multiply the batch \(X\) by weights \(W\) to get outputs \(Y\).
Backprop to inputs: push gradients through \(W^T\) to get \( \partial L / \partial X \).
Backprop to weights: combine \(X^T\) with \( \partial L / \partial Y \) to get \( \partial L / \partial W \).

Shapes are your superpower: if \(X\) is \(n \times d\) and \(W\) is \(d \times m\), then \(Y\) is \(n \times m\). The gradients match those same shapes.

Mini mission: label the shapes of \(X\), \(W\), and \(Y\).

Tap a tab to switch the picture.

Python practice

Foundations playground

Open the slide-out panel to practice the math and see live output.

Tip: Use print() to surface intermediate steps.

Editor

Output

Run the code to see output.

Vectors: direction, magnitude, and meaning

Explore vectors as data points, similarity scores, and distances. Switch to Deep Dive for norms, projections, and basis intuition.

Vector playground

Think of vector A as a data point \(x\) and vector B as model weights \(w\). Use the buttons to see how scores and distances behave.

Data point: the arrow from the origin to \(x\).
Modulus (length): \( \lVert x \rVert = \sqrt{x_1^2 + x_2^2} \).
Dot product: the shadow of \(x\) on \(w\).
Cosine: angle-only similarity.
Distance: how far two points are apart.

\[ x = [x_1, x_2], \quad w = [w_1, w_2] \]

x + w [0, 0]

Dot 0

Cosine 0

Angle 0°

Modulus ||x|| 0

Distance 0

Step-by-step with the current \(x\) and \(w\) values:

Sum

Dot

Modulus \(\lVert x \rVert\)

Cosine

Angle

Distance

Vector A (x) A.x 2.0 A.y 3.0

Vector B (w) B.x 1.0 B.y 4.0

Zoom Zoom 1.0

Deep dive: advanced geometry

Step through projections, norms, subspaces, gradients, vector stats, regularization, and attention with visual intuition.

\[ \lVert x \rVert = \sqrt{x_1^2 + x_2^2} \; (\text{modulus}), \quad x \cdot w = \sum_i x_i w_i, \quad x = \sum_i \alpha_i b_i \]

Scene 1 of 7

Scene 1: Projections and orthogonality

Visual

Mini example

ML intuition

Deep dive concepts

Advanced vector ideas that show up in PCA, optimization, and attention.

Projections & orthogonality

Projection onto \(u\): \( \mathrm{proj}_u(x) = \frac{x \cdot u}{u \cdot u} u \).
Orthogonal vectors give clean coordinate systems (PCA/SVD).

Norms & normalization

Modulus (length): \( \lVert x \rVert_2 = \sqrt{\sum_i x_i^2} \).
Normalize to compare directions independent of scale.

Subspaces, basis, rank

Span is the set of all linear combinations of vectors.
Rank is the dimension of that span (effective dimensionality).

Gradients, Jacobians, Hessians

\(\nabla_x L\) is the gradient of a scalar loss with respect to a vector.
Jacobian \(J\) stacks partials: \(J_{ij} = \partial f_i / \partial x_j\).
If \(f: \mathbb{R}^n \to \mathbb{R}^m\), then \(J\) is \(m \times n\) and \(f(x+\Delta x) \approx f(x) + J\Delta x\).
Hessian \(H = \nabla^2 L\) captures curvature; eigenvalues reveal minima vs saddles.
Second-order step: \(\Delta x = -H^{-1} \nabla L\) (Newton-style).

Vector statistics

Mean vector and covariance describe feature distributions.
Centering and whitening stabilize training and PCA.

Regularization geometry

L2 prefers small norms; L1 encourages sparsity.
Constraint shapes explain why L1 yields zeros.

Attention via dot products

Similarity scores become weights after softmax.
Output is a weighted sum of value vectors.

Python playground

Vector practice

Compute dot products, norms, and cosine similarity with live code.

Tip: Update x and w to match the visual slider values.

Matrices: transforms, determinants, inverses

Matrices reshape space. Determinants measure area scaling, inverses undo transforms, and covariance/precision describe spread along eigenvector axes.

Matrix playground

Focus on the three core moves: add, multiply, and reshape space with scaling + shear.

Addition mixes layers in the same shape.
Multiplication blends rows and columns.
Scaling stretches; shear slants the grid.

\[ C = AB \]

Multiplication mixes rows and columns to transform vectors.

Matrix deep dive: advanced operators

Explore determinants, inverses, covariance/precision, identity, and eigenvectors with step-by-step formulas.

\[ \det(A) = ad - bc \]

Determinant is the area scale factor; sign indicates a flip.

Small determinants signal near-singular matrices.
Precision matrices reweight directions in space.
Eigenvectors preserve direction under \(A\).

Derivations: covariance, precision, eigen

Determinant (2x2)

\[ A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, \quad \det(A)=ad-bc \]

Expand by the first row: \(a\cdot d - b\cdot c\).
\(|\det(A)|\) is area scale; sign means flip.
\(\det(A)=0\) implies no inverse exists.

Inverse (2x2)

\[ A^{-1}=\frac{1}{\det(A)}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]

Require \(\det(A)\neq 0\).
Swap the diagonal, negate off-diagonals.
Divide by \(\det(A)\) so \(AA^{-1}=I\).

Covariance matrix

\[ \mu=\frac{1}{n}\sum_{i=1}^n x_i,\quad X_c = X-\mathbf{1}\mu^T \] \[ \Sigma=\frac{1}{n-1}X_c^T X_c \]

Center each feature: subtract the mean \(\mu\).
Accumulate outer products: \(x_c x_c^T\).
Average: \(\Sigma_{xy}=\frac{1}{n-1}\sum (x_i-\mu_x)(y_i-\mu_y)\).

Precision matrix

\[ \Lambda = \Sigma^{-1} = \frac{1}{\sigma_x^2 \sigma_y^2 - \sigma_{xy}^2} \begin{bmatrix} \sigma_y^2 & -\sigma_{xy} \\ -\sigma_{xy} & \sigma_x^2 \end{bmatrix} \]

Compute \(\det(\Sigma)\).
Apply the 2x2 inverse formula.
In a Gaussian, \((x-\mu)^T\Lambda(x-\mu)\) is the Mahalanobis distance.

Eigenvalues & eigenvectors

\[ A\mathbf{v}=\lambda \mathbf{v},\quad \det(A-\lambda I)=0 \] \[ \lambda=\frac{\operatorname{tr}A \pm \sqrt{(\operatorname{tr}A)^2-4\det(A)}}{2} \]

Solve the characteristic equation for \(\lambda\).
Plug into \((A-\lambda I)\mathbf{v}=0\) to get \(\mathbf{v}\).
For symmetric \(A\), eigenvectors are orthonormal axes.

Mini mission: spot the operation that flips the grid.

Python playground

Matrix practice

Multiply matrices and inspect batch gradients in code.

Tip: Change X and W to see how shapes affect gradients.

Probability: uncertainty and confidence

Go from events to distributions, and see how probabilities become confidence bars.

Events, distributions, Bayes

Probabilities sum to 1 across a full set of outcomes.
Conditional probability updates beliefs with new evidence.
Expectations summarize the average outcome.

\[ P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} \] \[ \mathbb{E}[X] = \sum_x x \, P(X=x) \]

Graph. Bars show probability mass per outcome; the curve shows probability density; Bayes view compares prior, likelihood, and posterior.

Probability review: events & notation

\(P(A)\) is between 0 and 1.
Sample space \(S\) contains all outcomes; \(P(S)=1\).
Complement: \(P(\overline{A}) = 1 - P(A)\).
Union: \(P(A \cup B)\), Intersection: \(P(A \cap B)\).
Conditional: \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\).
Random variable \(X\) maps outcomes to numbers, with \(P(X=x)\).

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] \[ P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} \]

Venn diagram highlights the region tied to the formula.

Bayesian spam filter intuition

Estimate the prior \(P(\text{spam})\) from training data, then combine it with word likelihoods.

\[ P(\text{spam} \mid \text{message}) \propto P(\text{spam}) \prod_i P(\text{word}_i \mid \text{spam}) \]

Bag-of-words treats each word independently, so likelihoods multiply.

\[ P(\text{message}) = P(\text{message} \mid \text{spam})P(\text{spam}) + P(\text{message} \mid \text{ham})P(\text{ham}) \]

Prior \(P(\text{spam})\) 0.30

P(words|spam) -

P(words|ham) -

Posterior spam -

Prior × likelihoods → posterior.

Why negative log likelihood becomes the loss

Training chooses parameters that make the observed labels most probable. This is maximum likelihood estimation.

Model the probability of the correct label. For each example \( (x_i, y_i) \), the model returns \( p_\theta(y_i \mid x_i) \).
Likelihood of the dataset. Assuming samples are independent, \( L(\theta) = \prod_{i=1}^N p_\theta(y_i \mid x_i) \).
Log-likelihood simplifies the product. The log is monotonic, so maximizing \(L\) matches maximizing \( \log L = \sum_i \log p_\theta(y_i \mid x_i) \), and it avoids numerical underflow.
Negative log-likelihood is the minimization objective. We minimize \( \mathcal{L}(\theta) = -\log L(\theta) \). For one-hot classification, this is cross-entropy.

\[ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log p_\theta(y_i \mid x_i) = -\frac{1}{N}\sum_{i=1}^N \sum_k y_{ik} \log \hat{y}_{ik} \]

Step 1: Multiply

\[ L = 0.9 \times 0.6 \times 0.2 = 0.108 \]

Step 2: Log-sum

\[ \begin{aligned} \log L &= \log 0.9 + \log 0.6 + \log 0.2 \\ &= -0.105 - 0.511 - 1.609 \\ &= -2.225 \end{aligned} \]

Step 3: Negate

\[ \mathcal{L} = -\log L = 2.225 \]

Low loss means the model assigns high probability to the true labels; overconfident mistakes are penalized most.

Likelihood (product)

0.9 × 0.6 × 0.2 = 0.108

Multiply probabilities across samples.

\[ L(\theta) = \prod_i p_\theta(y_i \mid x_i) \]

→

Log-likelihood (sum)

\(\log 0.9\) + \(\log 0.6\) + \(\log 0.2\) = \(-2.225\)

-0.105 − 0.511 − 1.609

Log turns products into sums and keeps numbers stable.

\[ \log L(\theta) = \sum_i \log p_\theta(y_i \mid x_i) \]

→

Negative log-likelihood (loss)

\(-\log L\) = 2.225

Negate so minimization matches maximum likelihood.

\[ \mathcal{L}(\theta) = -\sum_i \log p_\theta(y_i \mid x_i) \]

Product → log sum → negate so minimization equals maximum likelihood.

Interactive

Log-likelihood playground: estimate \(p\)

Use a Bernoulli model to estimate the probability of success. Step through the derivation, then tweak the data to see the curve move.

Step 1 of 5

Define the data

Follow the steps, then use the sliders to see how the estimate changes.

Playground

Likelihood explorer

Adjust trials and successes. The curve peaks at the most likely \(p\).

Observed data -

MLE \(\hat{p}\) -

Likelihood \(L(p)\) -

Log-likelihood -

NLL -

Trials (N) N 10

Successes (k) k 5

Guess \(p\) p 0.50

The peak of the curve is the best probability estimate.

Sample space simulator

Uniform dots are outcomes. A is the left slice, B is the top slice, and the overlap shows \(A \cap B\).

\[ P(A)=\frac{|A|}{|S|},\quad P(A \mid B)=\frac{P(A \cap B)}{P(B)} \]

Event A width \(P(A)\) 0.60

Event B height \(P(B)\) 0.50

P(A) -

P(B) -

P(A ∩ B) -

P(A|B) -

Counts update from simulated dots.

Hidden Markov Models & filters

What is a Hidden Markov Model?

A Hidden Markov Model (HMM) describes a system with hidden states that evolve over time and generate observable data.

Hidden states transition via a Markov process.
Observations are emitted based on the current hidden state.
The task is to infer hidden states from observations.

\[ P(X_t \mid X_{t-1}) \quad \text{and} \quad P(E_t \mid X_t) \]

Common uses: speech recognition, NLP, bioinformatics, and time-series analysis.

What filters do

Filters estimate hidden state as new, noisy observations arrive, updating beliefs step by step.

Continuous states and observations.
Linear dynamics and observation models.
Gaussian noise assumptions.

The Kalman filter provides an optimal recursive estimate under those conditions.

Used in radar tracking, navigation, and robotics for position/velocity estimates.

Hidden Markov Model

Hidden states (circles) emit observations (squares) over time.

Kalman filter

Prediction (line) is corrected toward noisy measurements (dots).

Bernoulli

Binary outcomes with parameter \(p\).

Use for clicks, coins, yes/no labels.

\[ P(X=1)=p,\quad P(X=0)=1-p \]

Binomial

Counts successes in \(n\) trials.

Use for pass/fail counts in batches.

\[ P(X=k)=\binom{n}{k}p^k(1-p)^{n-k} \]

Categorical

Multiple discrete outcomes with probabilities.

Use for class labels or choices.

\[ P(X=i)=p_i,\quad \sum_i p_i = 1 \]

Normal

Continuous bell curve with \(\mu\) and \(\sigma\).

Use for noise, heights, residuals.

\[ X \sim \mathcal{N}(\mu,\sigma^2) \] \[ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)} \]

Python playground

Probability practice

Test softmax and cross-entropy with a tiny classifier.

Tip: Swap logits to see the loss change.

Gradient Descent

Finding the Minimum

Gradient descent is how neural networks learn. It's like walking downhill to find the lowest point:

\[ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla J(\theta) \]

\(\theta\): Parameter being optimized
\(\alpha\): Learning rate (step size)
\(\nabla J(\theta)\): Gradient (direction of steepest increase)

Learning Rate:

Python playground

Optimization practice

Run a few gradient steps and watch the loss drop.

Tip: Tune the learning rate to see convergence speed.

Activation Functions

Non-Linear Transformations

Activation functions add non-linearity to neural networks, enabling them to learn complex patterns:

Sigmoid: \( \sigma(x) = \frac{1}{1+e^{-x}} \)

Squashes values to (0, 1). Used for binary classification.

ReLU: \( f(x) = \max(0, x) \)

Most popular! Simple and effective. Returns x if positive, 0 otherwise.

Tanh: \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)

Squashes values to (-1, 1). Zero-centered version of sigmoid.

Sigmoid ReLU Tanh

What is Machine Learning?

Machine learning is the study of systems that learn patterns from data to make predictions or decisions without being explicitly programmed for each case.

Definition

Given data \(x\) and outcomes \(y\), machine learning finds a function \(f_\theta\) that maps inputs to outputs by optimizing parameters \(\theta\) to reduce error.

\[ \theta^* = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i) \]

How it works (intuitive)

Observe: collect examples and features.
Learn: fit a model that captures patterns.
Evaluate: test on new data to check generalization.
Use: deploy the model to make predictions.

Supervised learning

Learn from labeled examples to predict targets or classes.

Unsupervised learning

Discover structure or clusters without labels.

Self-supervised learning

Generate labels from data itself to learn useful representations.

Machine Learning Algorithms

Common algorithm families

Supervised: learn from labeled examples to predict or classify.
Unsupervised: discover structure without labels.
Probabilistic: express uncertainty with likelihoods and priors.
Ensembles: combine many models to reduce error.

\[ \theta^* = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i) + \lambda \lVert \theta \rVert^2 \]

Use the cards as a quick baseline picker before tuning deeper models.

Linear Regression

Predict continuous targets with a weighted sum.

Best-fit line follows the trend.

\[ \hat{y} = \mathbf{w}^\top \mathbf{x} + b \]

Logistic Regression

Binary classification with a sigmoid link.

Sigmoid separates two classes.

\[ P(y=1 \mid x) = \sigma(\mathbf{w}^\top \mathbf{x} + b) \]

k-NN

Predict by voting among nearest neighbors.

Nearest neighbors vote on the label.

Strong baseline for small data.

Decision Trees

Recursive if-then splits on features.

Splits carve the data space.

Interpretable and fast.

Random Forest

Bagged trees reduce variance.

Many trees vote together.

Robust, less tuning.

Support Vector Machine

Max-margin boundary; kernels add nonlinearity.

Max-margin boundary balances both classes.

Great for medium-sized data.

Naive Bayes

Probabilistic classifier with conditional independence.

Likelihoods overlap at the decision point.

Fast for text and spam.

k-Means

Cluster points around K centroids.

Centroids pull points into clusters.

Simple unsupervised grouping.

PCA

Project data onto top-variance directions.

Principal axis captures max variance.

Dimensionality reduction.

Formulas and derivations

Short derivations that connect each algorithm to its core objective.

Linear Regression

\[ J(\mathbf{w}) = \frac{1}{n}\lVert X\mathbf{w} - \mathbf{y} \rVert^2 \]

Take gradient: \( \nabla J = \frac{2}{n} X^T (X\mathbf{w} - \mathbf{y}) \).
Set to zero: \( X^T X \mathbf{w} = X^T \mathbf{y} \).
Solve for \( \mathbf{w} \) (normal equation) or use gradient descent.

Logistic Regression

\[ L = -\sum_i y_i \log \sigma(z_i) + (1-y_i)\log(1-\sigma(z_i)) \]

Let \( z_i = \mathbf{w}^T \mathbf{x}_i + b \), \( p_i = \sigma(z_i) \).
Derivative: \( \partial L / \partial z_i = p_i - y_i \).
Gradient: \( \nabla_{\mathbf{w}} L = X^T(\mathbf{p} - \mathbf{y}) \).

k-NN

\[ \hat{y} = \text{mode}\{y_i : i \in \mathcal{N}_k(x)\} \]

No training objective; store all labeled points.
Compute distances \( d(x, x_i) \) to all points.
Pick the k closest neighbors and vote (or average for regression).

Decision Trees

\[ \text{Gain} = H(\text{parent}) - \sum_k p_k H(\text{child}_k) \]

Compute impurity (entropy or Gini) at the parent node.
Evaluate candidate splits and compute weighted child impurity.
Choose the split with the largest information gain.

Random Forest

\[ \hat{y} = \frac{1}{T} \sum_{t=1}^T h_t(x) \]

Train many trees on bootstrapped samples with random feature subsets.
Aggregate predictions by averaging or voting.
Averaging reduces variance when trees are diverse.

Support Vector Machine

\[ \min_{\mathbf{w}} \; \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_i \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) \]

Maximize margin by minimizing \( \lVert \mathbf{w} \rVert^2 \).
Hinge loss penalizes points inside the margin or misclassified.
Only support vectors affect the optimal boundary.

Naive Bayes

\[ P(y \mid x) \propto P(y)\prod_i P(x_i \mid y) \]

Start from Bayes: \( P(y \mid x) = P(x \mid y)P(y)/P(x) \).
Assume conditional independence: \( P(x \mid y) = \prod_i P(x_i \mid y) \).
Choose the class with the largest posterior.

k-Means

\[ \min_{\{\mu_k\}} \sum_i \lVert x_i - \mu_{c_i} \rVert^2 \]

Assign each point to the nearest centroid.
Update centroids by setting \( \partial J / \partial \mu_k = 0 \Rightarrow \mu_k = \text{mean} \).
Repeat until assignments stabilize.

PCA

\[ \max_{\lVert \mathbf{w} \rVert = 1} \mathbf{w}^T \Sigma \mathbf{w} \]

Use Lagrange multiplier: \( \Sigma \mathbf{w} = \lambda \mathbf{w} \).
The top eigenvector gives the max-variance direction.
Project data onto the leading eigenvectors.

Algorithm Walkthroughs

Step-by-step algorithm views

Select an algorithm to see the key steps and a matching visualization.

Step 1 of 3

Load data

Training vs Inference

Step-by-step ML lifecycle

Training tunes the weights with data and gradients. Inference freezes the weights and just predicts.

Step 1 of 5

Load data batch

Neural Networks: From Neurons to Transformers

Zoom in from a single neuron all the way up to deep networks, CNNs, RNNs, LSTMs, and transformers. Each module walks step-by-step and animates the flow.

Forward propagation refresher

A network is stacked math. Inputs become features, hidden layers reshape them, and the output layer turns them into predictions.

Input layer: normalized features (pixels, embeddings, sensor readings).
Hidden layers: weighted sums + activations build new features.
Output layer: scores or probabilities for each class.

\[ \text{hidden} = \text{activation}(\text{input} \times W_1 + b_1) \] \[ \text{output} = \text{activation}(\text{hidden} \times W_2 + b_2) \]

Inference stops here. Training continues with loss, backprop, and weight updates.

Neural network architectures at a glance

Different architectures specialize in different kinds of data: images, sequences, or long-range context.

Feedforward nets handle tabular features and classic classification.
CNNs are spatial pattern detectors for images and video.
RNNs and LSTMs handle sequences over time.
Transformers scale to long context across domains.

Feedforward (MLP)

Classic dense layers for structured data.

Deep Neural Network

Stacks many layers to build feature hierarchies.

Convolutional Network

Filters slide across images to find patterns.

Recurrent Network

State flows through time for sequences.

LSTM

Gated memory handles long-range signals.

Transformer

Self-attention mixes all tokens at once.

Neuron basics: weighted sum -> activation

Each neuron multiplies inputs by weights, adds a bias, then runs an activation function.

Inputs arrive as numbers.
Weights scale each input.
Bias shifts the total.
Activation decides the output.

\[ z = \sum_i x_i w_i + b, \quad a = f(z) \]

Inputs x [0, 0, 0]

Weights w [0, 0, 0]

Weighted sum z 0

Bias b 0

Activation a 0

Step 1 of 4

Inputs arrive

Digit recognition: inference vs training

Simple digit recognition starts from pixels, transforms them into hidden features, and predicts a digit.

Inference (predict)

Read the image.
Flatten + normalize.
Hidden activations.
Output probabilities.

Training (learn)

Forward pass (same as inference).
Compute loss vs label.
Backpropagate gradients.
Update weights.

Target -

Prediction -

Loss -

Step 1 of 4

Image input

Backpropagation: step-by-step

Backpropagation computes how each weight should change to reduce the loss.

Forward pass: compute predictions.
Loss: compare prediction to target.
Backward pass: send gradients backward.
Update: adjust weights with the learning rate.

\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial w} \]

Step 1 of 4

Forward pass

Deep neural networks: stacking layers

More layers let the model build a hierarchy of features.

Early layers: edges and simple shapes.
Middle layers: textures and parts.
Late layers: whole objects or concepts.

Examples: face recognition, speech-to-text, medical scans.

Depth Layers 5

CNNs: slide a filter across the image

Convolutions scan a small filter across pixels to detect patterns like edges or corners.

Shared weights detect the same pattern anywhere.
Each output cell is a dot product of patch and kernel.
Pooling summarizes nearby activations.
Stacked filters build complex features.

\[ Y_{i,j} = \sum_{u=0}^{k-1}\sum_{v=0}^{k-1} X_{i+u, j+v} K_{u,v} + b \] \[ H_{out} = H - k + 1 \]

Patch -

Conv sum -

Step 1 of 16

Filter sweep

RNNs: memory over time

Recurrent networks reuse the same weights at every time step, passing a hidden state forward.

Hidden state stores short-term memory.
Same matrices \(W_x, W_h\) are reused at each step.
Great for sequences like text or sensor data.
Outputs can appear at every step.

\[ h_t = \tanh(W_x x_t + W_h h_{t-1} + b), \quad y_t = \mathrm{softmax}(W_y h_t) \]

Token -

Input x_t -

Prev hidden h_{t-1} -

Hidden h_t -

Next output -

Step 1 of 4

Time step 1

LSTM: long-short term memory

LSTMs add gates that decide what to forget, what to write, and what to output.

Forget gate keeps or drops old memory.
Cell state carries long-range info additively.
Input gate writes new information.
Output gate reveals the right part.

\[ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \] \[ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad \tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \] \[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \quad h_t = o_t \odot \tanh(c_t) \]

Memory before -

Forget gate -

Input gate -

Candidate \(\tilde{c}_t\) -

Memory after -

Output gate -

Hidden h_t -

Step 1 of 4

Forget gate

Transformers: staged pipeline

Transformers process all tokens at once, then use attention to mix information.

Tokenize and embed.
Add positional encoding.
Self-attention mixes context.
Feedforward + residual.
Output probabilities.

Tokens -

Focus token -

Stage -

Stage 1 of 5

Tokenize + embed

Advanced Transformer Model: step-by-step visualization

Use the stepper to follow the full Transformer pipeline without leaving the visuals.

Step 1 of 15

Tokenization

Model GPT-2 (small)

Embedding dim 768

Heads 12

Blocks 12

Vocabulary 50,257

Reference narrative (full explanation)

What is a Transformer?

Transformer is a neural network architecture that has fundamentally changed the approach to artificial intelligence. It was introduced in the 2017 paper "Attention is All You Need" and now powers models like OpenAI GPT, Meta Llama, and Google Gemini.

Transformers are not limited to text. They also drive audio generation, image recognition, protein structure prediction, and game playing, showing how broadly the architecture applies across domains.

Text-generative Transformers operate on next-token prediction: given a prompt, the model estimates the most probable next token. The core innovation is self-attention, which lets all tokens communicate and capture long-range dependencies.

Transformer Explainer is powered by GPT-2 (small) with 124 million parameters. While it is not the largest model, its components match the structure of more recent systems.

Transformer architecture

Every text-generative Transformer consists of three components:

Embedding: tokens become vectors and receive positional signals.
Transformer block: multi-head attention routes context; an MLP refines each token.
Output probabilities: logits are turned into a probability distribution for the next token.

Embedding pipeline

Suppose the prompt is: "Data visualization empowers users to". Embedding converts this text into a numerical representation in four steps:

Tokenization: split into word or subword tokens.
Token embedding: map tokens to a vector space.
Positional encoding: add position information.
Final embedding: sum token and positional vectors.

Figure 1. Expanding the embedding view: tokenization, token embedding, positional encoding, and final embedding.

GPT-2 (small) uses 768-dimensional embeddings and a vocabulary of 50,257 tokens. The embedding matrix has shape (50,257, 768) with about 39 million parameters.

Transformer block

The block combines multi-head self-attention and an MLP. GPT-2 (small) stacks 12 blocks, allowing token representations to evolve into higher-level meanings over depth.

Multi-head attention captures context across tokens. The MLP processes each token independently to refine its representation.

Multi-head self-attention

Each token is transformed into Query (Q), Key (K), and Value (V) vectors:

\[ QKV_{ij} = \left( \sum_{d=1}^{768} \text{Embedding}_{i,d} \cdot \text{Weights}_{d,j} \right) + \text{Bias}_j \]

Query is like the search text, Key is the result title, and Value is the page content. This analogy helps explain why attention scores route information from relevant tokens.

Figure 2. Computing Q, K, and V from the embedding.

Split into heads: GPT-2 uses 12 attention heads.
Masked attention: prevents peeking at future tokens.
Concat + projection: heads are merged for the next stage.

Scaling and masking set future positions to negative infinity so each token predicts without looking ahead.

Figure 3. Masked self-attention: dot product, scale + mask, softmax + dropout.

MLP: multi-layer perceptron

The MLP expands each token from 768 to 3072 dimensions with a GELU activation, then compresses back to 768. This enriches the representation independently per token.

Figure 4. The MLP expands then compresses each token representation.

Output probabilities and sampling

The final linear layer maps to 50,257 logits, one for each token in the vocabulary. Softmax turns logits into probabilities for the next token.

Figure 5. Each token receives a probability from the output logits.

Temperature controls sharpness: T=1 keeps logits unchanged, T<1 makes outputs more deterministic, and T>1 increases randomness.

Temperature

Lower values make outputs more deterministic; higher values add creativity.

Top-k

Restrict sampling to the top k highest-probability tokens.

Top-p

Sample from the smallest set whose cumulative probability exceeds p.

Auxiliary architectural features

Layer normalization stabilizes training and is applied twice per block. Dropout regularizes by randomly deactivating units during training and is disabled during inference. Residual connections add skip paths twice per block, helping gradients flow and preventing vanishing gradients.

The overview highlights where you are in the stack; the detail panel shows the math.

Large Language Models: Guided Tutorial

A hands-on path that ties prompting, model behavior, and agent building into one workflow.

Step 1

Frame the task

Define the role, goal, and success criteria before you write prompts.

Goal: a clear, testable objective.

Step 2

Structure the prompt

Use context blocks, constraints, and a target format.

Goal: predictable, repeatable outputs.

Step 3

Inspect model behavior

Adjust temperature, context, and stop rules to stabilize responses.

Goal: reduce variance and hallucinations.

Step 4

Add tools

Introduce tool calling for data access or automation.

Goal: verified outputs with traceability.

Step 5

Evaluate and iterate

Write evals, measure regressions, and refine.

Goal: steady improvements with evidence.

Quick tutorial checklist

Collect 5 real inputs and draft a baseline prompt.
Define the output format and failure cases.
Create a tool schema for any external data.
Write 5 evals with expected results.
Iterate until the eval pass rate stabilizes.

Mini mission: run a 5-example eval before you ship.

Tutorial starter prompt

ROLE: You are an AI PM.
TASK: Draft a 1-page feature brief.
CONTEXT:
"""
Include 5 user notes, metrics, and constraints.
"""
CONSTRAINTS: 250 words max. Bullet format.
OUTPUT FORMAT:
- Problem
- Users
- Constraints
- Proposal
- Risks
CHECKS: List missing inputs.

Transformer Depth Map: From Mechanics to Reasoning Models

Use this depth indicator to move from core mechanics to training signals and reasoning behaviors. Each tier adds a new layer of capability.

Depth 1

Transformer fundamentals

Mechanics and intuition.

Tokens turn into Q/K/V, attention mixes them, FFN refines, residual + norm stabilize.

The big picture

Transformers model sequences by letting every token attend to every other token in one step, so computation is parallel instead of strictly sequential.

Encoder-decoder stacks build a memory from the input, while decoder-only stacks generate the next token with a causal mask.

Scaled dot-product attention

Queries and keys produce similarity scores, softmax turns scores into weights, and values are mixed by a weighted sum.

Scaling by sqrt(d) keeps logits in a stable range so softmax does not saturate too early.

Multi-head + residuals + layer norm

Each head attends in its own subspace, the heads are concatenated, and a projection mixes them back together.

Residual paths keep the original signal available, while layer norm stabilizes scale across the block.

Positional encoding

Attention is permutation-invariant, so we add position signals to token embeddings before attention.

Sinusoidal encodings give each position a unique mix of frequencies; learned or relative schemes encode distance directly.

Depth 2

Transformer into an LLM

Training objectives and variants.

Decoder-only uses causal masks for next-token prediction; encoder-only sees both directions with masked tokens.

Decoder-only transformers (GPT-style)

A causal mask blocks attention to future tokens, so each position can only use past context.

Training uses teacher forcing: the model sees true previous tokens while learning to predict the next one.

Encoder-only transformers (BERT-style)

Masked language modeling hides tokens and uses both left and right context to reconstruct them.

The result is a strong bidirectional representation that transfers well to classification and retrieval tasks.

Training + inference basics

Logits become probabilities via softmax, then decoding picks the next token.

Greedy decoding is stable, while temperature and top-p sampling trade determinism for diversity.

Depth 3

Reasoning with transformers

Prompting and tool-use strategies.

CoT adds intermediate steps, self-consistency samples many paths, ReAct loops the model with tools.

Chain-of-Thought (CoT)

CoT encourages explicit intermediate steps, which can improve multi-step reasoning tasks.

Use it when the reasoning path matters, but keep in mind verbosity does not guarantee correctness.

Self-consistency

Self-consistency samples multiple reasoning paths and aggregates the final answers.

Majority voting reduces variance and often improves accuracy on reasoning benchmarks.

ReAct (reason + act)

ReAct alternates reasoning steps with tool calls, grounding answers in external data.

The loop keeps the model honest by injecting retrieved facts before the final response.

Depth 4

Reasoning models today

Training signals and test-time search.

Step-level feedback and search trees both add compute that shapes reasoning at test time.

Process vs outcome supervision

Process supervision scores intermediate steps, not just the final answer.

Step-level signals make credit assignment clearer and improve reasoning stability.

Tree-of-Thoughts (ToT)

Tree-of-Thoughts explores multiple branches, scores them, and expands the best candidates.

Search with backtracking can outperform a single forward pass on hard problems.

What "reasoning model" means today

Modern reasoning blends prompting patterns, test-time search, and step-aware supervision.

The behavior you see is shaped by where extra compute or feedback is injected.

Prompt Engineering: Crafting Clear Instructions

Design prompts that reduce ambiguity, keep responses grounded, and make iteration fast.

Prompt anatomy

Role: who the model is acting as.
Goal: the task and the success criteria.
Context: background data, definitions, and scope.
Constraints: length, tone, must include, must avoid.
Output format: schema, bullets, JSON, table.
Checks: ask for missing info or confidence notes.

Mini mission: rewrite a vague request into a 6-part prompt.

Pattern: Delimiters + grounding

Use clear separators so the model knows what to quote, summarize, or transform.

Pattern: Few-shot exemplars

Add one or two examples that match the output style you want.

Pattern: Self-checks

Ask for a brief verification step or uncertainties before the final answer.

Prompt template

ROLE: You are a product analyst.
TASK: Summarize customer feedback into 3 themes.
CONTEXT:
"""
Paste notes here.
"""
CONSTRAINTS: Max 8 bullets. No speculation.
OUTPUT FORMAT:
- Theme:
- Evidence:
CHECKS: Flag missing data.

Prompt technique examples

Technique: Role + constraints
ROLE: You are a support analyst.
TASK: Summarize the ticket in 3 bullets.
CONSTRAINTS: Neutral tone. No jargon.

Technique: Few-shot style match
TASK: Turn notes into a decision line.
EXAMPLE:
Input: "Latency 120ms, budget ok"
Output: "Decision: proceed with rollout"

Technique: Delimiters + extraction
Extract action items from <notes>...</notes>.
Return JSON with owner, task, and due.

Prompt refinement loop

Draft the instruction and the output format.
Test with a real example and inspect the gaps.
Add constraints, examples, or definitions.
Repeat until outputs are stable and on-brand.

LLM Fundamentals: How Models Respond

Know the mechanics behind next-token prediction so your prompts behave consistently.

Core behaviors

LLMs predict the next token based on all prior context.
Context windows limit how much text the model can see at once.
Sampling settings trade off creativity vs stability.
System instructions set global rules; user prompts set tasks.
Hallucinations appear when the prompt is underspecified.

Temperature

Controls randomness; lower is more deterministic.

Top-p

Limits choices to a probability mass.

Max tokens

Budget for output length and cost.

Stop sequences

End outputs at safe boundaries.

Prompting best practices

See the full playbook at promptingguide.ai.

Be explicit about the role, task, and constraints.
Use delimiters to separate instructions from data.
Provide a target format and a short example.
Ask for assumptions, risks, or missing data.
Iterate with real inputs, not toy prompts.

Context hygiene

Keep only relevant context; remove stale instructions.
Define terms once and reuse them consistently.
Chunk long inputs and summarize key decisions.

Vector Databases: Indexing and Retrieval for LLMs

See how embeddings get indexed, how similarity search works, and how results return to the model.

Indexing views

Indexing pipeline

Chunk documents, embed each chunk, and store vectors with metadata for filtering.

Split docs into overlapping chunks.
Embed each chunk into a vector space.
Write vectors and metadata to the index.

Metadata filters

Tags, permissions, and timestamps narrow search.

Chunk size

Balances recall with context length.

ANN indexing with HNSW

Hierarchical graph layers speed up nearest-neighbor search with high recall.

Insert vectors into a multi-layer graph.
Connect each node to its closest neighbors.
Search from top layer to dense base layer.

M parameter

Controls graph connectivity and recall.

efSearch

Higher values improve recall at a latency cost.

IVF + PQ compression

Inverted file indexing narrows search to the closest centroids, then product quantization compresses vectors for fast distance estimates.

Assign vectors to coarse centroids (IVF).
Search only the nearest lists.
Quantize subvectors into compact PQ codes.

nlist

Number of coarse clusters to search.

m * bits

Subvector count and code size per block.

Search and retrieval for LLMs

Blend lexical and vector results, then rerank and pack the best chunks into context.

Embed the query and run ANN search.
Merge lexical and vector candidates.
Rerank and assemble a grounded context window.

Reranking

Cross-encoders boost precision on top results.

Citations

Keep track of sources for trust and audits.

Mini mission: trace where the top-k chunks enter the prompt.

Switch tabs or cycle the diagram.

Sutskever's List: Core Readings

A curated reading list with the main themes and ideas captured for quick scanning.

Transformers Year: 2017

The Annotated Transformer

Annotated walkthrough of the Transformer model that uses self-attention for sequence tasks.

Theme: Attention-based encoder-decoder design.
Main idea: Line-by-line implementation clarifies attention, masking, and positional encoding.

Theory Year: n/a

The First Law of Complexodynamics

Blog essay connecting complexity, computation, and principles relevant to AI foundations.

Theme: Complexity growth in closed systems.
Main idea: Links entropy, computation, and limits of organized structure.

Sequence Models Year: n/a

The Unreasonable Effectiveness of Recurrent Neural Networks

Karpathy's blog showing why RNNs are powerful for sequence tasks with intuitive examples.

Theme: Sequence modeling with RNNs.
Main idea: Character-level RNNs learn structure and generate coherent sequences.

Sequence Models Year: n/a

Understanding LSTM Networks

Christopher Olah's visual and intuitive explanation of LSTM mechanisms and gates.

Theme: Gating and long-term memory.
Main idea: Input/forget/output gates control information flow and gradients.

Sequence Models Year: 2014

Recurrent Neural Network Regularization

Shows how regularization like dropout improves recurrent architectures.

Theme: Regularization for recurrent nets.
Main idea: Dropout-style methods improve generalization without breaking sequence learning.

Theory Year: n/a

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Explores regularization and model simplicity through information theory principles.

Theme: Minimum description length (MDL).
Main idea: Penalizing complexity encourages simpler, better-generalizing models.

Sequence Models Year: 2015

Pointer Networks

Neural architecture for solving combinatorial problems by learning pointer outputs.

Theme: Attention as a pointer mechanism.
Main idea: Output indices of input elements for variable-length solutions.

Vision Year: 2012

ImageNet Classification with Deep Convolutional Neural Networks

AlexNet paper that sparked the modern deep learning vision revolution.

Theme: Deep CNNs for large-scale vision.
Main idea: ReLU, dropout, and GPU training drive major accuracy gains.

Sequence Models Year: 2015

Order Matters: Sequence to Sequence for Sets

Explores how ordering impacts seq2seq model performance.

Theme: Set prediction with seq2seq.
Main idea: Learning an output order improves performance on set tasks.

Systems Year: 2018

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Describes pipeline parallelism for scaling large neural networks.

Theme: Model-parallel scaling.
Main idea: Micro-batch pipeline parallelism trains very large networks efficiently.

Vision Year: 2015

Deep Residual Learning for Image Recognition

ResNet paper that introduced residual connections to enable very deep models.

Theme: Residual learning.
Main idea: Identity skip connections allow very deep CNNs to train.

Vision Year: 2015

Multi-Scale Context Aggregation by Dilated Convolutions

Introduces dilated convolutions for large receptive fields without pooling.

Theme: Multi-scale context.
Main idea: Dilated convolutions expand receptive fields while preserving resolution.

Graph Networks Year: 2017

Neural Message Passing for Quantum Chemistry

Graph neural network model for learning on structured data.

Theme: Message-passing GNNs.
Main idea: Iterative node updates capture molecular structure for property prediction.

Transformers Year: 2017

Attention Is All You Need

Introduced the Transformer with self-attention, the foundation of modern NLP and LLMs.

Theme: Self-attention for sequence modeling.
Main idea: Replace recurrence with attention for parallel training and long-range context.

Sequence Models Year: 2014

Neural Machine Translation by Jointly Learning to Align and Translate

Shows how attention improves neural translation by aligning outputs with inputs.

Theme: Attention in seq2seq translation.
Main idea: Alignment learns which source tokens to focus on for each output.

Vision Year: 2016

Identity Mappings in Deep Residual Networks

Improves ResNet training using identity skip connections.

Theme: Optimization of deep residual networks.
Main idea: Pre-activation and identity skips improve gradient flow.

Theory Year: 2017

A Simple Neural Network Module for Relational Reasoning

Introduces a module for relational reasoning tasks.

Theme: Relational reasoning.
Main idea: Explicit pairwise relations improve reasoning performance.

Generative Models Year: 2015

Variational Lossy Autoencoder

Combines autoencoder with variational losses for generative modeling.

Theme: Lossy compression in generative models.
Main idea: Balance reconstruction with latent codes that capture global structure.

Sequence Models Year: 2018

Relational Recurrent Neural Networks

Combines relational reasoning with RNN mechanisms.

Theme: Memory + relational reasoning.
Main idea: Memory slots interact through attention to improve sequence modeling.

Theory Year: 2014

Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton

Examines complexity measures in closed computational systems.

Theme: Complexity dynamics.
Main idea: Complexity can rise and fall as systems evolve over time.

Memory Networks Year: 2014

Neural Turing Machines

Early memory-augmented neural network with external controller.

Theme: Differentiable external memory.
Main idea: A controller learns to read and write memory for algorithmic tasks.

Speech Year: 2015

Deep Speech 2: End-to-End Speech Recognition

End-to-end speech model demonstrating RNN and CNN integration.

Theme: End-to-end speech recognition.
Main idea: CNN + RNN stacks with CTC scale to large speech datasets.

Scaling Year: 2020

Scaling Laws for Neural Language Models

Shows how model and data scaling improve language model performance.

Theme: Empirical scaling behavior.
Main idea: Performance follows power laws in data, parameters, and compute.

Theory Year: n/a

A Tutorial Introduction to the Minimum Description Length Principle

Introductory explanation of MDL principle connecting compression and learning.

Link unavailable.

Theme: Compression-based model selection.
Main idea: Better compression implies better generalization.

Theory Year: n/a

Machine Super Intelligence

Discusses theoretical aspects of AGI and intelligence measures.

Link unavailable.

Theme: Theoretical intelligence measures.
Main idea: Frames limits and metrics for machine intelligence.

Theory Year: n/a

Kolmogorov Complexity and Algorithmic Randomness

Foundations of algorithmic complexity and information theory.

Link unavailable.

Theme: Algorithmic information theory.
Main idea: Randomness is defined by shortest program length.

Vision Year: n/a

CS231n: Convolutional Neural Networks for Visual Recognition

Stanford course notes covering fundamentals of CNNs and vision models.

Theme: Practical CNN foundations.
Main idea: Clear explanations of architectures, training, and regularization.

AI Agents: Tool-Using Systems

Agents combine planning, tools, memory, and evals to finish real tasks.

Agent loop

Plan the task and choose a strategy.
Select tools and call them with structured inputs.
Observe results and update the plan.
Decide when to stop and respond.

Planner

Breaks the goal into steps and picks the next action.

Tool router

Chooses APIs, search, or code execution based on the task.

Memory

Stores notes, intermediate results, and long-term facts.

Critic

Checks outputs, runs evals, and flags regressions.

Tool calling basics

Define clear tool schemas with typed inputs and outputs.
Validate tool results and handle failures explicitly.
Constrain tool access with allowlists and budgets.
Log tool calls for tracing and debugging.

How to write evals

Define a task set and success criteria.
Collect examples with expected outputs.
Use automatic checks plus human review when needed.
Track regressions and edge cases over time.

Eval-driven development

Write evals before adding new agent logic.
Run evals on every change to the prompt or tools.
Promote only the changes that improve the score.
Version datasets and prompts alongside code.

Frameworks

LangChain, LlamaIndex, and OpenAI or Anthropic SDKs.

Tracing

Capture tool calls, prompts, and outputs for audits.

Evals

Use unit tests, golden sets, and regression suites.

Safety

Guardrails for tool use, data access, and output policy.

Reinforcement Learning Fundamentals

Agents learn by interacting with an environment, collecting rewards, and improving a policy.

Reinforcement learning (RL) is a learning framework where an agent chooses actions in a state, receives rewards, and updates its policy to maximize long-term return.

\[ \text{Goal: } \max_\pi \; \mathbb{E}_\pi \left[\sum_{t=0}^{\infty} \gamma^t r_t \right] \]

Core concepts

Agent & environment: the learner and the world it interacts with.
Episode / trajectory: a sequence of states, actions, and rewards.
Reward vs return: immediate feedback \(r_t\) vs discounted sum \(G_t\).
Discount \(\gamma\): weighs future rewards.
Policy \( \pi(a|s) \): behavior rule.
Value \( V^\pi(s) \), Action-value \( Q^\pi(s,a) \).
Optimal \( V^*, Q^* \): best achievable values.

Minimal environment interface

reset(seed?) -> observation/state
step(action) -> { nextState, reward, done, info }
actions(state) -> action list
// render helpers stay separate from logic

reset(seed?) starts a new episode and returns the initial state (optionally deterministic with a seed).

step(action) applies an action and returns the transition tuple: next state, reward, terminal flag, and any extra info.

actions(state) exposes valid actions so the agent can plan or explore safely.

Render helpers stay separate so learning logic is deterministic and testable.

\[ V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_{t} \mid s_0 = s \right] \] \[ Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_{t} \mid s_0 = s, a_0 = a \right] \]

Return calculator

Discount \(\gamma\) 0.90

r₀ r₁ r₂ r₃ r₄

Return \(G\) = 0.00

\[ G = r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \cdots \]

Policy vs value (3-state chain)

Policy \( \pi(\text{right} \mid s_1) \) 0.50

Reinforcement Learning: a gentle mathematical walkthrough

RL formalizes learning by experience: act, observe, update, repeat.

1. Interaction over time

At each step the agent is in a state \(s\), takes an action \(a\), receives a reward \(r\), and lands in a new state \(s'\).

\[ s_0 \rightarrow a_0 \rightarrow r_1 \rightarrow s_1 \rightarrow a_1 \rightarrow r_2 \rightarrow s_2 \rightarrow \cdots \]

This sequence is a trajectory (episode).

2. Rewards vs return

Rewards are immediate feedback. Returns add up future rewards with discounting.

\[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \]

\(\gamma = 0\): only care about now.
\(\gamma \approx 1\): care about long-term outcomes.

3. Policy

A policy tells the agent how to act:

\[ \pi(a \mid s) \]

Deterministic: always take the same action.
Stochastic: choose actions by probability.

4. Value functions

Value functions measure how good states or actions are under a policy.

\[ V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s \right] \] \[ Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s, a_0=a \right] \]

V evaluates a state; Q evaluates a decision.

5. Bellman equations

Value is defined recursively: immediate reward plus discounted value of the next state.

\[ V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right] \] \[ V^*(s) = \max_a \sum_{s'} P(s' \mid s,a)\left[R(s,a,s') + \gamma V^*(s')\right] \]

6. Dynamic programming (model known)

If transitions and rewards are known, compute values directly.

\[ V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s,a)\left[R + \gamma V_k(s')\right] \]

Policy iteration alternates evaluation and greedy improvement.

7. Monte Carlo methods

Learn from complete episodes when the model is unknown.

Generate an episode and compute the return.
Update the first visit of each state (or state-action).
Unbiased but high variance; must wait for episode end.

8. Temporal Difference (TD)

Update values online using one-step bootstrapping.

\[ V(s) \leftarrow V(s) + \alpha \left[r + \gamma V(s') - V(s)\right] \] \[ \delta = r + \gamma V(s') - V(s) \]

\(\delta\) is the TD error: how wrong the prediction was.

9. Learning control

Learn the best actions, not just values.

\[ Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma Q(s',a') - Q(s,a)\right] \] \[ Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right] \]

SARSA uses the next action taken; Q-learning uses the best possible next action.

10. Why deep RL exists

Tables do not scale to high-dimensional states or continuous actions.

Approximate \(Q(s,a)\) or \(\pi(a \mid s)\) with neural networks.
Use experience replay to break correlation.
Use target networks to stabilize learning.
Optimize parameters with gradient descent.

Big picture summary

Reward

Immediate feedback signal.

Return

Discounted sum of future rewards.

Policy

Behavior rule \(\pi(a \mid s)\).

Value

How good a state or action is.

Bellman

Recursive value definitions.

Monte Carlo

Learn from full episodes.

TD learning

Learn step-by-step with bootstrapping.

Q-learning

Learn optimal behavior off-policy.

Deep RL

Scale RL with neural networks.

Multi-Armed Bandits

Explore exploration strategies and track regret as the agent learns.

Algorithms

\(\varepsilon\)-greedy with optional decay.
UCB1 balances optimism and uncertainty.
Thompson Sampling for Bernoulli rewards.

What is a multi-armed bandit? An agent repeatedly chooses among \(K\) actions (arms) with unknown reward distributions. The goal is to maximize total reward by balancing exploration (learn the arms) and exploitation (use the best-known arm).

\(\varepsilon\)-greedy

Initialize action-value estimates and counts.
At each step, explore with probability \(\varepsilon\); otherwise exploit the best estimate.
Pull the chosen arm and observe the reward.
Update that arm’s estimate with an incremental mean.
Optionally decay \(\varepsilon\) to reduce exploration over time.

UCB1

Initialize estimates and counts; try each arm at least once.
Compute the UCB score: \( \hat{\mu}_a + c \sqrt{\log t / n_a} \).
Select the arm with the highest UCB score.
Observe the reward and update the estimate and count.
Repeat to balance mean reward and uncertainty.

Thompson Sampling (Bernoulli)

Initialize Beta priors \(\alpha, \beta\) for each arm.
Sample a success probability from each arm’s Beta distribution.
Choose the arm with the highest sampled probability.
Observe reward (success/failure) and update \(\alpha, \beta\).
Repeat to naturally balance exploration and exploitation.

Seed

# Arms 5

Horizon 200

Noise 0.20

Algorithm

ε 0.20

Decay 0.000

UCB \(c\) 1.4

Thompson prior

Chart 1. True arm means (light) vs estimated means (dark).

Chart 2. Cumulative regret over time.

Chart 3. Action selection frequency by arm.

MDP + Dynamic Programming (Gridworld)

Solve a Markov Decision Process with Bellman updates and visualize value and policy.

MDP setup

MDP \( (S, A, P, R, \gamma) \) defines dynamics and rewards.
Bellman expectation: \( V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R + \gamma V^\pi(s')] \).
Bellman optimality: \( V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R + \gamma V^*(s')] \).

Deep dive: Markov Decision Process. An MDP assumes the future depends only on the current state and action (the Markov property). Transitions \(P(s'|s,a)\) and rewards \(R(s,a,s')\) define how the agent moves. Dynamic programming solves the MDP by repeatedly applying Bellman backups until values and policies converge.

Discount \(\gamma\) 0.90

Step cost -0.04

Slip probability 0.00

Preset

Algorithm

Iteration 0

Show V(s) Heatmap Policy arrows

Policy evaluation (iterative)

Initialize \(V(s)\) for all states.
For each state, compute the Bellman expectation backup under \(\pi\).
Update \(V(s)\) with the new estimate.
Repeat until the max change \(\Delta V\) is below a threshold.

Policy improvement

Use the current \(V(s)\) to compute \(Q(s,a)\).
For each state, choose the greedy action \( \arg\max_a Q(s,a) \).
Update the policy to be greedy (or \(\varepsilon\)-greedy).
Repeat after re-evaluating the policy.

Policy iteration

Initialize a policy \(\pi\).
Evaluate \(\pi\) to compute \(V^\pi\).
Improve \(\pi\) by acting greedy with respect to \(V^\pi\).
Stop when the policy stops changing.

Value iteration

Initialize \(V(s)\) arbitrarily.
Apply the optimality backup \( V(s) \leftarrow \max_a \sum_{s'} P [R + \gamma V(s')] \).
Repeat until \(V\) converges.
Extract the greedy policy from the converged values.

Inspect state

Click a cell to see Q(s,·).

Grid. Values \(V(s)\), heatmap, and policy arrows per state.

Chart. Max \(\Delta V\) and average \(V\) per iteration.

Monte Carlo Methods

Estimate values from complete episode returns.

Algorithms

First-visit MC prediction for \(V^\pi\).
On-policy MC control with \(\varepsilon\)-soft policies.

Deep dive: Monte Carlo methods. MC methods wait until an episode ends, then use the realized return to update value estimates. They are unbiased but can have high variance, so averaging many episodes stabilizes learning.

First-visit MC prediction

Generate a full episode using policy \(\pi\).
For each state, find the first time it appears in the episode.
Compute the return \(G_t\) from that first visit to the end.
Update \(V(s)\) with the average of observed returns.
Repeat with many episodes to reduce variance.

On-policy MC control (\(\varepsilon\)-soft)

Initialize \(Q(s,a)\) and a soft policy.
Generate an episode following the current policy.
Compute returns for each state-action first visit.
Update \(Q(s,a)\) with averaged returns.
Improve the policy to be \(\varepsilon\)-greedy w.r.t. \(Q\).

Episodes

ε-soft 0.20

Grid. Episode rollout animation and evolving \(V(s)\) heatmap.

Chart. Returns histogram (recent episodes).

Chart. Value estimate of the start state over episodes.

Temporal Difference Learning

Blend bootstrapping with sampling for faster learning.

TD(0) prediction

Update rule: \( V(s) \leftarrow V(s) + \alpha [r + \gamma V(s') - V(s)] \).
Random-walk demo with 5 states.

TD(0) prediction

Initialize \(V(s)\) for all states.
Observe a transition \( (s, r, s') \).
Compute the TD error \( \delta = r + \gamma V(s') - V(s) \).
Update \(V(s) \leftarrow V(s) + \alpha \delta \).
Repeat over many episodes to converge.

α 0.10

Chart. Current value estimates for the random-walk states.

Chart. TD error \(\delta_t\) over a single episode.

Chart. Value estimates per state across episodes.

Model-Free Control

Learn optimal policies directly from experience.

Algorithms

SARSA: on-policy TD control.
Q-learning: off-policy TD control.
Expected SARSA: expectation over policy.

SARSA

Initialize \(Q(s,a)\) and choose a behavior policy.
Take action \(a\), observe \(r, s'\), and pick next action \(a'\) from the same policy.
Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s',a') - Q(s,a)] \).
Set \(s \leftarrow s'\), \(a \leftarrow a'\) and repeat.

Q-learning

Initialize \(Q(s,a)\) and choose a behavior policy (e.g., \(\varepsilon\)-greedy).
Take action \(a\), observe \(r, s'\).
Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \).
Repeat until Q-values converge to \(Q^*\).

Expected SARSA

Initialize \(Q(s,a)\) and a stochastic policy.
Observe \(r, s'\) after taking action \(a\).
Compute expected next value \( \sum_{a'} \pi(a'|s') Q(s',a') \).
Update \( Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \mathbb{E}_{a'}Q(s',a') - Q(s,a)] \).

Algorithm

ε 0.20

α 0.20

Grid. Greedy policy arrows from the current Q-table.

Chart. Q-value heatmaps per action (Up/Right/Down/Left).

Chart. Exploration schedule \(\varepsilon\) over time.

Chart. Episode return over training.

Deep RL Concepts

When state spaces grow, deep networks approximate value functions or policies.

Core ideas

Policy gradients optimize expected return directly.
Replay buffers stabilize off-policy learning.
Target networks reduce moving-target instability.

Policy gradient (REINFORCE-style)

Collect trajectories by sampling from \(\pi_\theta(a|s)\).
Compute returns or advantages for each action.
Estimate the gradient \( \nabla_\theta \log \pi_\theta(a|s) \hat{A} \).
Update parameters \( \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) \).
Repeat with variance reduction (baseline) for stability.

Deep Q-learning (conceptual)

Collect transitions into a replay buffer.
Sample mini-batches uniformly from the buffer.
Compute target \( r + \gamma \max_{a'} Q_{\text{target}}(s',a') \).
Minimize the TD loss between \(Q_\theta\) and targets.
Periodically sync the target network.

\[ \nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) \, \hat{A}(s,a)] \]

Deep RL diagnostic loop

Collect experience into a replay buffer.
Sample mini-batches for stable updates.
Sync target networks periodically.
Track reward curves and policy entropy.

Policy gradient intuition

Increase the probability of actions that led to higher return and decrease others.

\[ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \, \hat{A} \]

Jupyter-Style Math Notebook

Work through the foundations topics in order. Each cell runs in a shared kernel for a practical hands-on workshop.

Notebook goal

Cover the Foundations math with a practical hands-on workshop: linear algebra, calculus, optimization, probability, activations, matrix calculus, plus PyTorch fundamentals and tutorials.

Section 1: Python for math

Lists, loops, and functions to prep for vector math.

In [1] Numbers, lists, and summaries

In [2] Functions and list math

Section 2: Linear algebra (vectors + matrices)

Dot products, norms, matrix-vector products, and matrix multiplication.

In [3] Vector ops

In [4] Matrices: matvec + matmul

Section 3: Calculus + optimization

Estimate derivatives and follow gradients downhill.

In [5] Numerical derivative

In [6] Gradient descent on a quadratic

In [7] Plot a curve

Section 4: Probability + activations

Turn scores into probabilities and compare activation curves.

In [8] Softmax + cross-entropy

In [9] Activation functions

In [10] Plot activation curves

Section 5: Matrix calculus (batch gradients)

Compute a vectorized gradient for linear regression.

In [11] Batch gradient for linear regression

Section 6: PyTorch fundamentals

Tensors, autograd, modules, and optimizers. These cells print install guidance if PyTorch is unavailable.

Install PyTorch for this lab

This web lab runs in the browser and cannot install PyTorch. To run the PyTorch cells, open the notebook in local Jupyter or Colab and run the install cell below. For GPU builds, use the command from the PyTorch get-started page.

In [12] Install PyTorch (local or Colab)

In [13] Setup + device

In [14] Tensors + matrix multiply

In [15] Autograd basics

In [16] Linear layer forward pass

In [17] Loss + optimizer step

Section 7: In-browser inference (ONNX Runtime Web)

Run a small ONNX model directly in the browser using JavaScript.

What this demo does

Loads a tiny MNIST classifier and runs inference on simple 28x28 input patterns. No Python kernel required.

If the model URL fails, use another ONNX model that accepts a 1x1x28x28 float tensor.

JS MNIST inference demo

Model URL

Output will appear here.

Idle