Clustering DNA Sequences with Bourgain Embedding¶

This blog post is a continuation of the previous blog post on Bourgain Embedding. TLDR: Bourgain Embedding is a technique used to embed an arbitrary finite metric space (eg. DNA sequences) into $\mathbb{R}^{O(\log^2 n)}$ with the $l^2$-norm with distortion at most $O(\log n)$. The previous blog post explains the Bourgain Embedding in detail and provides a Python implementation. In this blog post, we will use Bourgain Embedding to cluster DNA sequences.

The dataset consists of 1000 DNA sequences of various lengths. Each DNA sequence has a corresponding "Target" attached to it, which determines if the DNA sequence is non-coding (0) or coding (1) for a gene. The goal is the cluster the sequences based on the similarity of the DNA sequences. We will use Bourgain Embedding to embed the DNA sequences into $\mathbb{R}^{O(\log^2 n)}$ vector space and then use K-means clustering to make predictions.

Why Bourgain Embedding?¶

The embedding is necessary because K-means clustering requires the input data to be in $(\mathbb{R}^k, ||\cdot||_{2})$ metric space. The DNA sequences are not vectors in Euclidean space, and hence we need to embed these sequences as vectors in $\mathbb{R}^k$ using Bourgain Embedding.

Recently, there has been a lot of interest in embedding strings (e.g. texts) into a latent space using deep learning techniques. However, not every problem requires such a complicated machinery. Bourgain Embedding is a simple and elegant technique that can be used to embed strings into Euclidean space, and it has theoretical guarantees on the distortion!

In [35]:
import numpy as np
import pandas as pd
import Levenshtein
from tqdm import tqdm
import pickle
%run ../BourgainEmbeddings/BourgainEmbedding.ipynb

Load the data into a metric space¶

Define $(X, d)$ where $X = \{A, G, C, T\}^n$, and $d$ is the Levenshtein distance (edit distance) between the sequences. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another (convince yourself why this is indeed a metric on X).

In [ ]:
data = pd.read_csv('./dna_seqs.csv')
X = data['DNA_sequence'].sample(1000)

# Generate the targets for the chosen samples (based on indices)
targets = np.zeros(len(X))
for i in range(len(X)):
    targets[i] = data['Target'][X.index[i]]

# Convert X to numpy array
X = X.to_numpy()

# Generate the distance matrix
distance = np.zeros((len(X), len(X)))
for i in tqdm(range(len(X))):
    for j in range(len(X)):
        if i != j:
            distance[i, j] = Levenshtein.distance(X[i], X[j])

Embed the data into $(\mathbb{R}^{O(\log^2 n)}, ||\cdot||_2)$¶

This class is imported from the previous blog post. It contains the Bourgain Embedding implementation, with $c = 10$. Therefore, for this dataset, the embedding will be in $\mathbb{R}^{10 \cdot \lceil \log^2(1000) \rceil} = \mathbb{R}^{1000}$ .


In [ ]:
be = BourgainEmbedding(X, distance)

Train a classifier on the embedded data¶

Now that we've transformed the metric space $(X, d) \to (\mathbb{R}^{1000}, ||\cdot||_2)$, we can use K-means clustering to cluster the sequences. We will use the KMeans class from sklearn.cluster to cluster the sequences. Then, the accuracy of the clustering is printed.

In [201]:
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(be.get_embedding(), targets, test_size=0.25, random_state=42)

kmeans = KMeans(n_clusters=2).fit(X_train)
y_pred = kmeans.predict(X_test)

print(f'Accuracy: {100*accuracy_score(y_test, y_pred)}%')
Accuracy: 62.8%

Voila! $62.8\%$ accuracy is not bad for a simple clustering algorithm with basic hyper-parameters. The accuracy can be improved by using more sophisticated features and clustering algorithms. This techniques goes to show that there is no need to train a deep neural network for every problem. Sometimes, simple algorithms can do the job just as well.

Epilogue¶

For a comparision, we can create a deep learning model to classify the DNA sequences and compare the accuracy with the K-means clustering. The DNA sequences are one-hot-encoded and fed into a Convolutional Neural Network (CNN) with a convolution layer, max-pooling layer, and a dense layer. The model is trained for 50 epochs with a batch size of 32. The accuracy of the model is printed at the bottom.

In [250]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

# Define one-hot encoding for nucleotides
mapping = {'A': [1, 0, 0, 0], 'T': [0, 1, 0, 0], 'G': [0, 0, 1, 0], 'C': [0, 0, 0, 1]}


encoded_sequences = [
    [mapping[char] for char in seq] for seq in X
]

# Pad Sequences
max_len = max(len(seq) for seq in X)
padded_sequences = np.array([
    np.pad(seq, ((0, max_len - len(seq)), (0, 0)), mode='constant')
    for seq in encoded_sequences
])


X_train, X_test, y_train, y_test = train_test_split(padded_sequences, targets, test_size=0.25, random_state=42)
In [251]:
class DNADataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.float32)  # One-hot encoded sequences
        self.labels = torch.tensor(labels, dtype=torch.float32)       # Labels (binary classification)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

# Create DataLoader for batch processing
batch_size = 32
train_dataset = DNADataset(X_train, y_train)
test_dataset = DNADataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
In [252]:
class DNAClassifierOneHot(nn.Module):
    def __init__(self, input_dim, num_filters, kernel_size, max_len):
        super(DNAClassifierOneHot, self).__init__()
        self.conv = nn.Conv1d(input_dim, num_filters, kernel_size, padding='same')
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.fc = nn.Sequential(
            nn.Linear(num_filters, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        x = x.permute(0, 2, 1)
        x = self.conv(x)
        x = self.pool(x).squeeze(-1)
        x = self.fc(x)
        return x
In [253]:
input_dim = 4
num_filters = 64
kernel_size = 3


model = DNAClassifierOneHot(input_dim, num_filters, kernel_size, max_len)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


num_epochs = 50
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for sequences, labels in train_loader:
        sequences, labels = sequences.to(device), labels.to(device)


        outputs = model(sequences).squeeze()
        loss = criterion(outputs, labels)


        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
In [246]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for sequences, labels in test_loader:
        sequences, labels = sequences.to(device), labels.to(device)
        outputs = model(sequences).squeeze()
        predictions = (outputs > 0.5).float()  
        correct += (predictions == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Test Accuracy: {100*accuracy:.4f}%")
Test Accuracy: 70.0000%

The CNN model has an accuracy of $70.0\%$, which is better than the K-means clustering. However, the CNN model is more complex and requires more hyper-parameter tuning. The K-means clustering is a simple algorithm that can be used to cluster the DNA sequences with decent enough accuracy. Sometimes, simple is better!