Metric Spaces and Embeddings¶
Metric Spaces¶
A metric space $(X, d)$ is a set $X$ with a distance function $d: X \times X \to \mathbb{R}$. The distance function satisfies the following properties:
- $d(x, y) \geq 0$ for all $x, y \in X$.
- $d(x, y) = 0$ if and only if $x = y$.
- $d(x, y) = d(y, x)$ for all $x, y \in X$.
- $d(x, y) \leq d(x, z) + d(z, y)$ for all $x, y, z \in X$.
Some examples of metric spaces:
- A weighted graph $G = (V, E)$ is a metric space, with $X = V$ and $d(x, y) =$ length of the shortest path between $x$ and $y$.
- The DNA space with $X = \{A, C, G, T\}^n$ and $d(x, y) = $number of positions where $x$ and $y$ differ.
- The Euclidean space $\mathbb{R}^n$ with $X = \mathbb{R}^n$ and $d(x, y) = \|x - y\|_2$.
Maps¶
A map $f: X \to Y$ between two metric spaces $(X, d_X)$ and $(Y, d_Y)$ is called an embedding that maps elements of $X$ to elements of $Y$. The embedding is said to be distance-preserving (isometric) if $d_Y(f(x), f(y)) = d_X(x, y)$ for all $x, y \in X$. However, very rarely do we have distance-preserving embeddings between metric spaces. Instead, we often consider embeddings that are "almost" distance-preserving.
Embeddings and Distortions¶
An embedding with distortion of $\alpha$ of a metric space $(X, d_X)$ into another metric space $(Y, d_Y)$ is a map $f: X \to Y$ such that there exists constant $r > 0$ for which $r \cdot d_X(x, y) \leq d_Y(f(x), f(y)) \leq \alpha r \cdot d_X(x, y)$ for all $x, y \in X$. The distortion of an embedding is the smallest $\alpha$ for which such a map exists.
We can equivalently define the distortion in terms of the contraction and expansion. Given a map $f: X \to Y$, let:
Contraction($f$) $ = \max_{x, y \in X} \frac{d_Y(f(x), f(y))}{d_X(x, y)}$
Expansion($f$) $= \max_{x, y \in X} \frac{d_X(x, y)}{d_Y(f(x), f(y))}$.
Define the distortion of $f$ as $\alpha = \text{Expansion}(f)\cdot\text{Contraction}(f)$.
Bourgain Embedding¶
Often times, we will have a finite metric space $(X, d)$ where the the notion of distance is abstract. An example of such a metric space is a categorical dataset where each point is a category, and the distance between two different categories is $1$, and the distance between same categories is $0$. Another example is the Hamming cube, where each point is a binary string of length $n$ and the distance between two points is the number of positions where the strings differ.
When working with such metric spaces, many machine learning techniques often fail because the distance function is not Euclidean. One way to address this issue is to embed the metric space into $\mathbb{R}^k$ for some $k$ such that the distance function is approximately preserved. The Bourgain Embedding is a technique to do just that.
Given an arbitrary finite metric space $(X, d)$, a Bourgain embedding is a way to embed the metric space into the Euclidean space $\mathbb{R}^k$ with a distortion of at most $O(\log n)$, where $n$ is the number of points in the metric space and $k \in O(\log^2 n)$. The proof of this theorem is beyond the scope of this notebook, but we will see how to implement Bourgain embeddings in practice. This is a randomized algorithm, with the distortion bound holding with high probability.
The Bourgain embedding algorithm is as follows:
Let $c$ be a sufficiently large constant, and let $\log(n)$ denote the base-2 logarithm of $n$.
For each point $x \in X$, define its embedding vector $f(x)$ in the following steps:
- For $i \in \{1, 2, \dots, \lceil \log_2(n) \rceil \}$:
For $j \in \{1, 2, \dots, c \cdot \lceil \log_2(n) \rceil \}$:
- Choose a random subset $S_{ij} \subseteq X$, where each $y \in X$ is included in $S_{ij}$ with probability $2^{-i}$.
- Compute $d(x, S_{ij})$, the minimum distance from $x$ to any point in $S_{ij}$:
Construct the embedding vector: $f(x) = \langle d(x, S_{11}), d(x, S_{12}), \dots, d(x, S_{\lceil \log_2(n) \rceil \cdot c \cdot \lceil \log_2(n) \rceil})\rangle.$
- For $i \in \{1, 2, \dots, \lceil \log_2(n) \rceil \}$:
import numpy as np
import math
import random
class BourgainEmbedding:
def __init__(self, X, d):
"""
Initializes the Bourgain Embedding object using the Bourgain
Embedding algorithm
Args:
X (np.ndarray): Data matrix of size (n, d), where n is the
number of points with d dimensions.
d (np.ndarray): Distance matrix of size (n, n).
"""
self.X = X
self.d = d
self.n = len(X)
self.log_n = int(math.ceil(math.log2(self.n)))
self.m = 576 * int(self.log_n) # Chosen c = 576 as per the paper
self.f_X = np.zeros((self.n, self.log_n * self.m))
self.d_prime = self.__embed()
def __map(self, x_idx):
"""
Given an index x_idx, returns the value of f(x) for the point X[x_idx]
Args:
x_idx (int): Index of the point in X
"""
f_x = np.zeros(self.log_n * self.m)
for i in range(self.log_n):
for j in range(self.m):
included = np.random.choice(
[1, 0], size=self.n,
p=[1/(2**i), 1 - 1/(2**i)])
S_ij = [k for k in range(self.n) if included[k] == 1]
if len(S_ij) == 0:
S_ij = [x_idx] # Ensure that S_ij is non-empty
# Compute running minimum while iterating through S_ij
min_distance = float('inf')
for k in S_ij:
min_distance = min(min_distance, self.d[x_idx, k])
f_x[i * self.m + j] = min_distance
return f_x
# Embeds the points in X into a Euclidean space of dimension m
def __embed(self):
# Compute the value of f(x) for each x in X
f_X = np.zeros((self.n, self.log_n * self.m))
for i in range(self.n):
f_X[i] = self.__map(i)
self.f_X = f_X
# Compute the pairwise distances between the points in
# the embedded space using L1 norm
d_prime = np.zeros((self.n, self.n))
for i in range(self.n):
for j in range(self.n):
d_prime[i][j] = np.linalg.norm(f_X[i] - f_X[j])
return d_prime
def distance(self, i, j):
"""
Returns the pairwise distance between the points X[i] and X[j]
in the embedded space
Args:
i (int): Index of the first point in X
j (int): Index of the second point in X
"""
return self.d_prime[i][j]
def get_distance_matrix(self):
"""
Returns the pairwise distance matrix of the embedded points
"""
return self.d_prime
def get_embedding(self):
"""
Returns the embedded points
"""
return self.f_X
Testing the Bourgain embedding¶
Example: DNA space with the Hamming metric¶
We will test the Bourgain embedding algorithm on the DNA space with the Hamming metric. The DNA space is the set of all DNA sequences of length $n$ with the Hamming metric, which is the number of positions where two DNA sequences differ. We will embed the DNA space into $\mathbb{R}^k$ using the Bourgain embedding algorithm and compute the distortion of the embedding.
def random_dna(length):
return ''.join(random.choice('ACGT') for _ in range(length))
def ham_dist(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)
def gen_seqs(num_sequences, sequence_length):
sequences = [random_dna(sequence_length)]
for _ in range(num_sequences - 1):
new_sequence = random_dna(sequence_length)
while any(ham_dist(new_sequence, seq) == 0 for seq in sequences):
new_sequence = random_dna(sequence_length)
sequences.append(new_sequence)
return sequences
num_sequences = 100
sequence_length = 200
dnaX = gen_seqs(num_sequences, sequence_length)
d = np.zeros((num_sequences, num_sequences))
for i in range(num_sequences):
for j in range(num_sequences):
d[i][j] = ham_dist(dnaX[i], dnaX[j])
# Perform the embedding
embedding = BourgainEmbedding(dnaX, d)
# Print statistics (expansion/contraction and distortion of embedding)
expansions = []
contractions = []
for i in range(num_sequences):
for j in range(num_sequences):
if i != j:
expansions.append(embedding.distance(i, j) / d[i, j])
contractions.append(d[i, j] / embedding.distance(i, j))
expansion = np.max(expansions)
contraction = np.max(contractions)
print("Embedding Statistics:")
print("=====================")
print(f"Old Dimension: {sequence_length}, New Dimension: {embedding.m}")
print(f"Expansion: {expansion}, Contraction: {contraction}")
print("Distortion: ", expansion*contraction)
Embedding Statistics: ===================== Old Dimension: 200, New Dimension: 7000 Expansion: 121.65237319510048, Contraction: 0.011089042951659453 Distortion: 1.3490083915317743
# Perform multidimensional scaling and plot the old and new distances, side by side
from sklearn.manifold import MDS
import matplotlib.pyplot as plt
dpr = embedding.get_distance_matrix()
mdsEmbed = MDS(n_components=2, dissimilarity='precomputed')
mdsOld = mdsEmbed.fit_transform(d)
mdsNew = mdsEmbed.fit_transform(dpr)
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].scatter(mdsOld[:, 0], mdsOld[:, 1])
axs[0].set_title("MDS of Hamming Distances")
axs[1].scatter(mdsNew[:, 0], mdsNew[:, 1])
axs[1].set_title("MDS of Embedded Distances (L2 norm)")
plt.show()