📐 Partial Derivatives and Gradients

Why Derivatives Matter

In this chapter we train models via gradient-based optimization. Derivatives turn model errors into actionable parameter updates. For linear regression with prediction $\hat{y}=\theta_0+\theta_1 x$ and mean squared error

$$ J(\theta_0,\theta_1) = \frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2, $$

the partial derivatives are

$$ \frac{\partial J}{\partial \theta_0} = -\frac{2}{N} \sum_n (y_n-\hat{y}_n), \qquad \frac{\partial J}{\partial \theta_1} = -\frac{2}{N} \sum_n x_n\,(y_n-\hat{y}_n). $$

Gradient descent updates parameters by stepping opposite the gradient: $\theta \leftarrow \theta - \eta\,\nabla J$.

What Is a Partial Derivative?

For a function $f(x_1,\dots,x_d)$, the partial derivative with respect to $x_k$ measures how $f$ changes when only $x_k$ varies while the rest are held constant:

$$ \frac{\partial f}{\partial x_k}(\mathbf{x}) = \lim_{h\to 0} \frac{f(x_1,\dots, x_k+h,\dots, x_d) - f(x_1,\dots,x_k,\dots,x_d)}{h}. $$

The Gradient Vector

The gradient collects all partial derivatives in a single vector:

$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \tfrac{\partial f}{\partial x_1} & \tfrac{\partial f}{\partial x_2} & \cdots & \tfrac{\partial f}{\partial x_d} \end{bmatrix}^{\!\top}. $$

It points in the direction of the steepest increase of $f$ and its magnitude equals the maximal directional derivative at $\mathbf{x}$.

Geometric intuition

Level sets (contours) of $f$ are orthogonal to the gradient: moving along $-\nabla f$ decreases $f$ most rapidly.

Why Gradients Matter in ML

Optimization methods such as gradient descent update parameters $\theta$ by following $-\nabla J(\theta)$, where $J$ is the objective. Accurate partial derivatives ensure stable learning and diagnose vanishing/exploding sensitivity across dimensions.

Rules You’ll Use Often

Linearity across coordinates: $\tfrac{\partial}{\partial x_k}(af + bg) = a\,\tfrac{\partial f}{\partial x_k} + b\,\tfrac{\partial g}{\partial x_k}$.
Product (per coordinate): $\tfrac{\partial}{\partial x_k}(fg) = (\tfrac{\partial f}{\partial x_k})g + f(\tfrac{\partial g}{\partial x_k})$.
Chain rule (vector form): if $g(\mathbf{x})\in\mathbb{R}$ and $h(t)\in\mathbb{R}$, then $\nabla(h\circ g) = h'(g(\mathbf{x}))\,\nabla g(\mathbf{x})$.
Common tables: power, exp/log, and trig derivatives apply componentwise.

Interactive Gradient

Enter a multivariable function and variables to compute $\nabla f$ symbolically. Use WolframAlpha for stepwise explanations.

Worked Example

Let $f(x,y) = x^2 y + \sin(xy)$. Then

$$ \frac{\partial f}{\partial x} = 2xy + y\cos(xy), \qquad \frac{\partial f}{\partial y} = x^2 + x\cos(xy), $$

and

$$ \nabla f(x,y) = \begin{bmatrix} 2xy + y\cos(xy) \\ x^2 + x\cos(xy) \end{bmatrix}. $$

Practical Considerations

Notation: $\nabla f$ is a column vector; be consistent when matching library conventions.
Domains: respect domain restrictions (e.g., $\ln x$ requires $x>0$).
Scaling: feature scaling affects the relative magnitude of partials and can improve optimization stability.
Verification: cross-check symbolic results numerically via finite differences if needed.

References

S. Boyd, L. Vandenberghe. Convex Optimization — https://web.stanford.edu/~boyd/cvxbook/
J. Nocedal, S. Wright. Numerical Optimization (Wiley) — https://www.wiley.com/en-us/Numerical+Optimization-p-9780387400653
Wikipedia. Gradient — https://en.wikipedia.org/wiki/Gradient
Wolfram MathWorld. Gradient — https://mathworld.wolfram.com/Gradient.html