Clustering DNA Sequences with Bourgain Embedding¶
This blog post is a continuation of the previous blog post on Bourgain Embedding. TLDR: Bourgain Embedding is a technique used to embed an arbitrary finite metric space (eg. DNA sequences) into $\mathbb{R}^{O(\log^2 n)}$ with the $l^2$-norm with distortion at most $O(\log n)$. The previous blog post explains the Bourgain Embedding in detail and provides a Python implementation. In this blog post, we will use Bourgain Embedding to cluster DNA sequences.
The dataset consists of 1000 DNA sequences of various lengths. Each DNA sequence has a corresponding "Target" attached to it, which determines if the DNA sequence is non-coding (0) or coding (1) for a gene. The goal is the cluster the sequences based on the similarity of the DNA sequences. We will use Bourgain Embedding to embed the DNA sequences into $\mathbb{R}^{O(\log^2 n)}$ vector space and then use K-means clustering to make predictions.
Why Bourgain Embedding?¶
The embedding is necessary because K-means clustering requires the input data to be in $(\mathbb{R}^k, ||\cdot||_{2})$ metric space. The DNA sequences are not vectors in Euclidean space, and hence we need to embed these sequences as vectors in $\mathbb{R}^k$ using Bourgain Embedding.
Recently, there has been a lot of interest in embedding strings (e.g. texts) into a latent space using deep learning techniques. However, not every problem requires such a complicated machinery. Bourgain Embedding is a simple and elegant technique that can be used to embed strings into Euclidean space, and it has theoretical guarantees on the distortion!
import numpy as np
import pandas as pd
import Levenshtein
from tqdm import tqdm
import pickle
%run ../BourgainEmbeddings/BourgainEmbedding.ipynb
Load the data into a metric space¶
Define $(X, d)$ where $X = \{A, G, C, T\}^n$, and $d$ is the Levenshtein distance (edit distance) between the sequences. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another (convince yourself why this is indeed a metric on X).
data = pd.read_csv('./dna_seqs.csv')
X = data['DNA_sequence'].sample(1000)
# Generate the targets for the chosen samples (based on indices)
targets = np.zeros(len(X))
for i in range(len(X)):
targets[i] = data['Target'][X.index[i]]
# Convert X to numpy array
X = X.to_numpy()
# Generate the distance matrix
distance = np.zeros((len(X), len(X)))
for i in tqdm(range(len(X))):
for j in range(len(X)):
if i != j:
distance[i, j] = Levenshtein.distance(X[i], X[j])
Embed the data into $(\mathbb{R}^{O(\log^2 n)}, ||\cdot||_2)$¶
This class is imported from the previous blog post. It contains the Bourgain Embedding implementation, with $c = 10$. Therefore, for this dataset, the embedding will be in $\mathbb{R}^{10 \cdot \lceil \log^2(1000) \rceil} = \mathbb{R}^{1000}$ .
be = BourgainEmbedding(X, distance)
Train a classifier on the embedded data¶
Now that we've transformed the metric space $(X, d) \to (\mathbb{R}^{1000}, ||\cdot||_2)$, we can use K-means clustering to cluster the sequences. We will use the KMeans
class from sklearn.cluster
to cluster the sequences. Then, the accuracy of the clustering is printed.
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(be.get_embedding(), targets, test_size=0.25, random_state=42)
kmeans = KMeans(n_clusters=2).fit(X_train)
y_pred = kmeans.predict(X_test)
print(f'Accuracy: {100*accuracy_score(y_test, y_pred)}%')
Accuracy: 62.8%
Voila! $62.8\%$ accuracy is not bad for a simple clustering algorithm with basic hyper-parameters. The accuracy can be improved by using more sophisticated features and clustering algorithms. This techniques goes to show that there is no need to train a deep neural network for every problem. Sometimes, simple algorithms can do the job just as well.
Epilogue¶
For a comparision, we can create a deep learning model to classify the DNA sequences and compare the accuracy with the K-means clustering. The DNA sequences are one-hot-encoded and fed into a Convolutional Neural Network (CNN) with a convolution layer, max-pooling layer, and a dense layer. The model is trained for 50 epochs with a batch size of 32. The accuracy of the model is printed at the bottom.
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
# Define one-hot encoding for nucleotides
mapping = {'A': [1, 0, 0, 0], 'T': [0, 1, 0, 0], 'G': [0, 0, 1, 0], 'C': [0, 0, 0, 1]}
encoded_sequences = [
[mapping[char] for char in seq] for seq in X
]
# Pad Sequences
max_len = max(len(seq) for seq in X)
padded_sequences = np.array([
np.pad(seq, ((0, max_len - len(seq)), (0, 0)), mode='constant')
for seq in encoded_sequences
])
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, targets, test_size=0.25, random_state=42)
class DNADataset(Dataset):
def __init__(self, sequences, labels):
self.sequences = torch.tensor(sequences, dtype=torch.float32) # One-hot encoded sequences
self.labels = torch.tensor(labels, dtype=torch.float32) # Labels (binary classification)
def __len__(self):
return len(self.sequences)
def __getitem__(self, idx):
return self.sequences[idx], self.labels[idx]
# Create DataLoader for batch processing
batch_size = 32
train_dataset = DNADataset(X_train, y_train)
test_dataset = DNADataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
class DNAClassifierOneHot(nn.Module):
def __init__(self, input_dim, num_filters, kernel_size, max_len):
super(DNAClassifierOneHot, self).__init__()
self.conv = nn.Conv1d(input_dim, num_filters, kernel_size, padding='same')
self.pool = nn.AdaptiveMaxPool1d(1)
self.fc = nn.Sequential(
nn.Linear(num_filters, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
x = x.permute(0, 2, 1)
x = self.conv(x)
x = self.pool(x).squeeze(-1)
x = self.fc(x)
return x
input_dim = 4
num_filters = 64
kernel_size = 3
model = DNAClassifierOneHot(input_dim, num_filters, kernel_size, max_len)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 50
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for epoch in range(num_epochs):
model.train()
total_loss = 0
for sequences, labels in train_loader:
sequences, labels = sequences.to(device), labels.to(device)
outputs = model(sequences).squeeze()
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
model.eval()
correct = 0
total = 0
with torch.no_grad():
for sequences, labels in test_loader:
sequences, labels = sequences.to(device), labels.to(device)
outputs = model(sequences).squeeze()
predictions = (outputs > 0.5).float()
correct += (predictions == labels).sum().item()
total += labels.size(0)
accuracy = correct / total
print(f"Test Accuracy: {100*accuracy:.4f}%")
Test Accuracy: 70.0000%
The CNN model has an accuracy of $70.0\%$, which is better than the K-means clustering. However, the CNN model is more complex and requires more hyper-parameter tuning. The K-means clustering is a simple algorithm that can be used to cluster the DNA sequences with decent enough accuracy. Sometimes, simple is better!