Dual Numbers and Automatic DifferentiationΒΆ
I recently came across a curious algebraic structure called the Dual Number system, denoted as $\mathbb{R}(\epsilon)$. The dual numbers are an extension of the real numbers, with a very interesting property: functions of dual numbers can be differentiated automatically. This is a powerful tool in numerical analysis and machine learning, as it allows for efficient computation of derivatives.
DefinitionΒΆ
Let $\epsilon \neq 0$ be a new "number" such that $\epsilon^2 = 0$. The dual numbers are defined as:
$$\mathbb{R}(\epsilon) = \{ a + b\epsilon \mid a, b \in \mathbb{R} \}$$
where $a$ is the real part and $b$ is the dual part. The dual number $a + b\epsilon$ can be thought of as a point in a two-dimensional space, where the real part $a$ represents the x-coordinate and the dual part $b$ represents the y-coordinate.
Stepping away from the formal definition, the number $\epsilon$ is very strange. How can a non zero number give zero when squared? The answer to this is that $\epsilon$ is not a real number, but rather we can think of it as an infinitesimal. Infinitesimals are numbers that are smaller than any positive real number, but they are larger than zero. They are used in non-standard analysis, a branch of mathematics that extends the real number system to include infinitesimals. TL;DR: we just made up a new object completely, so we can do whatever we want with it.
Basic Arithmetic and PropertiesΒΆ
Given two dual numbers $x = a + b\epsilon$ and $y = c + d\epsilon$, we can perform basic arithmetic operations as follows:
Addition: $$x + y = (a + c) + (b + d)\epsilon$$
Subtraction: $$x - y = (a - c) + (b - d)\epsilon$$
Multiplication: $$x \cdot y = (a \cdot c) + (a \cdot d + b \cdot c)\epsilon$$
Division (assuming $c \neq 0$): $$\frac{x}{y} = \frac{(a + b\epsilon)}{(c + d\epsilon)} = \frac{a}{c} + \frac{(b \cdot c - a \cdot d)}{c^2} \epsilon$$
Note that $\mathbb{R}(\epsilon)$ is a ring, but not a field. A ring is a set of objects where addition/multiplication is well defined and behaves like addition/multiplication we're familiar with, but division is not always possible. For example, division by a dual number with a non-zero dual part is not defined. Additionally, this structure is strange because it has zero divisors, which means that there are non-zero elements $x$ and $y$ such that $x \cdot y = 0$. For example, any dual number of the form $0 + b\epsilon$ is a zero divisor because $$ (0 + b\epsilon) \cdot (0 + d\epsilon) = 0 + 0 \epsilon = 0 $$
Automatic DifferentiationΒΆ
Let's say we have a function that describes the error of a machine learning model with respect to its parameters. To keep things simple, let's say there is only one parameter $w$ and the error function is $E(w)$. We want to minimize this error function and find the optimal value of $w$. One way to do this is to compute the derivative of the error function with respect to $w$, denoted as $E'(w)$, and then change $w$ in the direction of the negative slope of the error function. If $E$ is a simple function, we can compute the derivative by hand. However, in practice, the error function can be very complex, and computing the derivative by hand can be tedious and error-prone. For example, imagine $E$ rolled random numbers to figure out what mathematical operation it should preform on a number next (ex. if we roll a 0, add 3, if we roll a 1 multiply by 2, if we roll a 2, square the number ... etc). In this case, it would be very difficult to compute the derivative by hand.
This is where automatic differentiation comes in. By using dual numbers, we can compute the derivative of a function automatically, while evaluating the function at the same time. This way, we can compute the value of the function at some input, and in the end also get the derivative of the function at that input.
Using Dual Numbers for Automatic DifferentiationΒΆ
Let's say we have a polynomial function $f(x) = x^2$. We can extend this function to accept dual numbers as input by replacing the real number $x = a$ with a dual number $x = a + \epsilon$. $$ f(x) = f(a + \epsilon) = (a + \epsilon)^2 = a^2 + 2a\epsilon + \epsilon^2 = a^2 + 2a\epsilon $$ Here, we make a very interesting observation: the term $2a \epsilon$ is the derivative of the function $f(x)$ evaluated at $x = a$. Indeed, in general if we have a polynomial function $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_1 x + a_0$, we can compute the derivative of the function at $x = a$ by simply evaluating the function at the dual number $x = a + \epsilon$ and taking the coefficient of $\epsilon$ in the result. This follows because $$ \begin{align*} f(a + \epsilon) &= a_n (a + \epsilon)^n + a_{n-1} (a + \epsilon)^{n-1} + \ldots + a_1 (a + \epsilon) + a_0 \\ &= a_n \sum_{i=0}^{n} \binom{n}{i} a^{n-i} \epsilon^i + a_{n-1} \sum_{i=0}^{n-1} \binom{n-1}{i} a^{n-1-i} \epsilon^i + \ldots + a_1 (a + \epsilon) + a_0 \\ &= a_n (a^n + n a^{n-1} \epsilon) + a_{n-1} (a^{n-1} + (n-1) a^{n-2} \epsilon) + \ldots + a_1 (a + \epsilon) + a_0 \\ &= a_n a^n + a_n n a^{n-1} \epsilon + a_{n-1} a^{n-1} + a_{n-1} (n-1) a^{n-2} \epsilon + \ldots + a_1 a + a_1 \epsilon + a_0 \\ &= (a_n a^n + a_{n-1} a^{n-1} + \ldots + a_1 a + a_0) + (a_n n a^{n-1} + a_{n-1} (n-1) a^{n-2} + \ldots + a_1) \epsilon \\ &= f(a) + f'(a) \epsilon \end{align*} $$ Therefore, the coefficient of $\epsilon$ in the result is the derivative of the function at $x = a$! This means that if we have a function $f(x)$, we can compute the derivative of the function at $x = a$ by simply evaluating the function at the dual number $x = a + \epsilon$ and taking the coefficient of $\epsilon$ in the result.
class AutoDiff:
def __init__(self, value, derivative=1.0):
self.value = value
self.derivative = derivative
def __add__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(self.value + other.value,
self.derivative + other.derivative)
else:
return AutoDiff(self.value + other, self.derivative)
def __radd__(self, other):
return self.__add__(other)
def __sub__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(self.value - other.value,
self.derivative - other.derivative)
else:
return AutoDiff(self.value - other, self.derivative)
def __rsub__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(other.value - self.value,
other.derivative - self.derivative)
else:
return AutoDiff(other - self.value, -self.derivative)
def __mul__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(self.value * other.value,
self.derivative * other.value +
self.value * other.derivative)
else:
return AutoDiff(self.value * other, self.derivative * other)
def __rmul__(self, other):
return self.__mul__(other)
def __truediv__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(self.value / other.value,
(self.derivative * other.value - self.value * other.derivative)
/ (other.value ** 2))
else:
return AutoDiff(self.value / other, self.derivative / other)
def __rtruediv__(self, other):
if isinstance(other, AutoDiff):
return AutoDiff(other.value / self.value,
(other.derivative * self.value - other.value * self.derivative)
/ (self.value ** 2))
else:
return AutoDiff(other / self.value, -other * self.derivative / (self.value ** 2))
def __pow__(self, power):
return AutoDiff(self.value ** power, power * self.value ** (power - 1) * self.derivative)
def __repr__(self):
return f"AutoDiff(value={self.value}, derivative={self.derivative})"
# Example polynomial function and auto-differentiation.
def f(x: AutoDiff) -> AutoDiff:
return x**3 + 2*x**2 + 3*x + 4
def g(x: AutoDiff) -> AutoDiff:
return x**2 + 2*x + 1
def h(x: AutoDiff) -> AutoDiff:
return x**5 + 3*x**3 + 2*x + 1
def j(x: AutoDiff) -> AutoDiff:
return x**4 + 2*x**3 + 3*x**2 + 4*x + 5
def f_derivative(x: float) -> float: # for comparison
return 3*x**2 + 4*x + 3
def g_derivative(x: float) -> float: # for comparison
return 2*x + 2
def h_derivative(x: float) -> float: # for comparison
return 5*x**4 + 9*x**2 + 2
def j_derivative(x: float) -> float: # for comparison
return 4*x**3 + 6*x**2 + 6*x + 4
x_vals = np.linspace(-3, 3, 100)
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
# f(x)
y_vals = [f(AutoDiff(x)).value for x in x_vals]
y_deriv_vals = [f(AutoDiff(x)).derivative for x in x_vals]
y_deriv_vals_exact = [f_derivative(x) for x in x_vals]
axs[0, 0].plot(x_vals, y_vals, label='$f(x) = x^3 + 2x^2 + 3x + 4$')
axs[0, 0].plot(x_vals, y_deriv_vals, label="$f'(x)$ autodiff")
axs[0, 0].plot(x_vals, y_deriv_vals_exact, label="$f'(x)$ exact", linestyle='--')
axs[0, 0].axhline(0, color='black', lw=0.5, ls='--')
axs[0, 0].axvline(0, color='black', lw=0.5, ls='--')
axs[0, 0].set_xlabel('x')
axs[0, 0].set_ylabel('y')
axs[0, 0].legend()
axs[0, 0].grid()
# g(x)
y_vals = [g(AutoDiff(x)).value for x in x_vals]
y_deriv_vals = [g(AutoDiff(x)).derivative for x in x_vals]
y_deriv_vals_exact = [g_derivative(x) for x in x_vals]
axs[0, 1].plot(x_vals, y_vals, label='$g(x) = x^2 + 2x + 1$')
axs[0, 1].plot(x_vals, y_deriv_vals, label="$g'(x)$ autodiff")
axs[0, 1].plot(x_vals, y_deriv_vals_exact, label="$g'(x)$ exact", linestyle='--')
axs[0, 1].axhline(0, color='black', lw=0.5, ls='--')
axs[0, 1].axvline(0, color='black', lw=0.5, ls='--')
axs[0, 1].set_xlabel('x')
axs[0, 1].set_ylabel('y')
axs[0, 1].legend()
axs[0, 1].grid()
# h(x)
y_vals = [h(AutoDiff(x)).value for x in x_vals]
y_deriv_vals = [h(AutoDiff(x)).derivative for x in x_vals]
y_deriv_vals_exact = [h_derivative(x) for x in x_vals]
axs[1, 0].plot(x_vals, y_vals, label='$h(x) = x^5 + 3x^3 + 2x + 1$')
axs[1, 0].plot(x_vals, y_deriv_vals, label="$h'(x)$ autodiff")
axs[1, 0].plot(x_vals, y_deriv_vals_exact, label="$h'(x)$ exact", linestyle='--')
axs[1, 0].axhline(0, color='black', lw=0.5, ls='--')
axs[1, 0].axvline(0, color='black', lw=0.5, ls='--')
axs[1, 0].set_xlabel('x')
axs[1, 0].set_ylabel('y')
axs[1, 0].legend()
axs[1, 0].grid()
# j(x)
y_vals = [j(AutoDiff(x)).value for x in x_vals]
y_deriv_vals = [j(AutoDiff(x)).derivative for x in x_vals]
y_deriv_vals_exact = [j_derivative(x) for x in x_vals]
axs[1, 1].plot(x_vals, y_vals, label='$j(x) = x^4 + 2x^3 + 3x^2 + 4x + 5$')
axs[1, 1].plot(x_vals, y_deriv_vals, label="$j'(x)$ autodiff")
axs[1, 1].plot(x_vals, y_deriv_vals_exact, label="$j'(x)$ exact", linestyle='--')
axs[1, 1].axhline(0, color='black', lw=0.5, ls='--')
axs[1, 1].axvline(0, color='black', lw=0.5, ls='--')
axs[1, 1].set_xlabel('x')
axs[1, 1].set_ylabel('y')
axs[1, 1].legend()
axs[1, 1].grid()
plt.tight_layout()
plt.show()
Non-Polynomial FunctionsΒΆ
The above method works for polynomial functions, but what about non-polynomial functions? The good news is that we can still use dual numbers to compute the derivative of non-polynomial functions. The key is to use the Taylor series expansion of the function around the point $x = a$. The Taylor series expansion of a function $f(x)$ around the point $x = a$ is given by:
$$ f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x - a)^n $$
If we replace $x$ with the dual number $x = a + \epsilon$, we get:
$$ f(a + \epsilon) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} ((a + \epsilon) - a)^n = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} \epsilon^n $$
Since $\epsilon^2 = 0$, all terms with $n \geq 2$ vanish, and we are left with:
$$ f(a + \epsilon) = f(a) + f'(a) \epsilon $$
This means that we can still compute the derivative of non-polynomial functions using dual numbers, as long as we can compute the Taylor series expansion of the function around the point $x = a$. For example, we know that the Taylor series expansion of the exponential function $e^x$ around the point $x = 0$ is given by:
$$ e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} $$
We can compute this sum to an arbitrary degree of accuracy by treating it as a finite polynomial. For example, if we want to compute $e^x$ to the second order, we can use the first three terms of the series:
$$ e^x \approx 1 + x + \frac{x^2}{2} $$
If we replace $x$ with the dual number $x = a + \epsilon$, we get:
$$ e^{a + \epsilon} \approx 1 + (a + \epsilon) + \frac{(a + \epsilon)^2}{2} = 1 + a + \epsilon + \frac{a^2}{2} + a \epsilon + \frac{\epsilon^2}{2} $$
Since $\epsilon^2 = 0$, we can ignore the last term, and we are left with:
$$ e^{a + \epsilon} \approx (1 + a + \frac{a^2}{2}) + (1 + a) \epsilon $$
For small enough $x = a$, we have that $e^x \approx 1 + x$, so we can see that the coefficient of $\epsilon$ in the result is indeed the derivative of the function at $x = a$. Very cool!