1.The Machinery Within
Inside every cell, life depends on assemblies of proteins coming together like small, deliberate machines. They twist, fold, lock, release. They behave with a kind of calm engineering — a natural machinery hidden in plain sight.
For decades, these molecular machines were largely invisible. Even with powerful experimental methods, it could take years to uncover a single structure. Many in the field believed it would take centuries to map the full landscape of protein complexes.
2.A Profound Shift
Then came a transformation.
Advances in machine learning and artificial intelligence — especially the breakthrough of AlphaFold at the CASP competition — changed the timeline of an entire discipline. What once felt 200 years away began to take shape in front of us, almost overnight.
AI didn't just help us predict structures; it opened doors to scientific inquiries we never imagined we would be able to attempt in our lifetime.
3.Our Work
While success in predicting single protein structures is groundbreaking, the next challenge is to predict how proteins come together to form complexes, where the true biological action often takes place.
Building a deep learning model to predict quaternary structure—the way multiple protein subunits come together to form functional complexes—presents a unique set of challenges that go beyond the progress made in predicting individual protein structures. Unlike single protein models, which focus on determining the precise 3D shape of an isolated protein, quaternary structure requires understanding how multiple proteins interact, assemble, and function as a unit.
We approach this work with attention, and with respect for the hidden patterns that sustain life.
4.Diving Deeper: Our Approach
If we have a set of protein structures from a model organism, \((x_1, x_2, \dots, x_n)\), where each \(x_i \in \mathbb{R}^d\) is a vector representation of a protein chain or domain, then we can group these into \(k\) functional or structural "buckets," \(\mathcal{S} = \{ S_1, S_2, \dots, S_k \}\), so that proteins in the same bucket are structurally similar.
The objective is to minimize the within-cluster sum of squares (WCSS):
$$ \underset{\mathcal{S}}{\arg\min} \; \sum_{i=1}^k \sum_{x \in S_i} \| x - \mu_i \|^2 $$Here, \(\mu_i\) is the centroid (average embedding) of cluster \(S_i\):
$$ \mu_i = \frac{1}{|S_i|} \sum_{x \in S_i} x $$where \(|S_i|\) is the number of proteins in cluster \(S_i\). Intuitively, each cluster represents a structural "theme" and can also be viewed as minimizing the average pairwise distance between proteins inside each cluster:
$$ \underset{\mathcal{S}}{\arg\min} \; \sum_{i=1}^k \frac{1}{|S_i|} \sum_{x,y \in S_i} \| x - y \|^2 $$with the equivalence given by:
$$ |S_i| \sum_{x \in S_i} \| x - \mu_i \|^2 = \tfrac{1}{2} \sum_{x,y \in S_i} \| x - y \|^2 $$This becomes a way of organizing the raw geometry of protein space, turning thousands of structures into a handful of interpretable groups we can later test for possible interactions.
A bonfire in a strong wind is not blown out, but blazes even brighter.