The question we are trying to answer is deceptively simple to state: given a set of proteins, which ones form complexes, how do those complexes assemble, and what do they look like? The difficulty is that the answer depends on geometry, chemistry, evolutionary history, and cellular context simultaneously. No single representation captures all of it. Our approach is to build representations that capture enough — and to be rigorous about what "enough" means at each stage of the problem.

Protein embeddings

The foundation of our technical work is the protein embedding: a dense, continuous vector that encodes the structural and functional properties of a protein in a form amenable to computation. We draw on large pretrained language models — trained on hundreds of millions of protein sequences — and adapt their representations through fine-tuning on structure-annotated datasets. The resulting embeddings place structurally similar proteins nearby in vector space, even when their sequences have diverged beyond recognition. They encode not just fold type, but surface chemistry, predicted binding propensity, and evolutionary family membership. This is the substrate on which everything else is built.

Critically, we embed not just individual proteins but protein regions: domains, surface patches, and interface-adjacent residues. Interaction happens at surfaces, and a whole-protein embedding that averages over all of them loses the spatial specificity that determines whether two proteins can bind. Our embedding pipeline produces hierarchical representations — protein-level, domain-level, and patch-level — that can be queried at the appropriate resolution for a given prediction task.

The goal is not a single model that does everything. It is a series of models that, together, make the problem tractable — each one reducing the search space the next must cover.

Clustering structural space

With embeddings in hand, we apply clustering methods to organise the space of known proteins into interpretable groups. The motivation is twofold. First, clustering imposes structure on a search problem that would otherwise be combinatorially intractable: rather than considering all pairs of proteins as potential interaction candidates, we can prioritise pairs that sit in clusters with known interaction histories. Second, clusters serve as interpretable units of analysis — they correspond, roughly, to protein families and superfamilies, and studying which clusters interact with which gives us a coarse-grained map of the protein interaction network that guides model design.

Complex structure prediction

Our model architecture is designed to predict quaternary structure from these representations. Given two or more proteins, represented at the appropriate embedding resolution, the model predicts interface geometry: which surfaces come into contact, at what orientation, and with what confidence. We train on experimentally determined complex structures from the PDB and the growing set of structures resolved by cryo-EM, with particular attention to symmetrical homo-oligomers and well-studied hetero-complexes where ground truth is reliable. The architecture attends to both the embedding-level representations and to explicit structural coordinates where available, reasoning jointly about sequence, structure, and the geometric constraints that physical binding imposes.

We do not expect to solve the full problem in a single model. Our near-term goal is accurate prediction of binary interaction and approximate interface geometry for proteins within known structural families — a tractable sub-problem that provides useful signal for drug discovery and generates the training signal needed to extend coverage to novel folds. Each iteration expands the scope, and each expansion reveals the next question the model is not yet equipped to answer.