Optimization and historical contingency in protein sequences
Protein sequences are shaped by functional optimization on the one hand and by evolutionary history, i.e. phylogeny, on the other hand. A multiple sequence alignment of homologous proteins contains sequences which evolved from the same ancestral sequence and have similar structure and function. In such an alignment, correlations in amino-acid usage at different sites can arise from structural and functional constraints due to coevolution, but also from historical contingency.
Correlations arising from phylogeny often confound coevolutionary signal from functional or structural optimization, impairing the inference of structural contacts from sequences. I will show that inferred Potts models are more robust than local statistics to these effects, which may explain their success. I will argue that phylogenetic correlations can also provide useful information for some inference tasks, especially to infer interaction partners from sequences among the paralogs of two protein families. In this case, signal from phylogeny and signal from constraints combine constructively.
Protein language models have recently been applied to sequence data, greatly advancing structure, function and mutational effect prediction. Language models trained on multiple sequence alignments capture coevolution and structural contacts, but also phylogenetic relationships. I will discuss a method we recently proposed that leverages these models to predict which proteins interact among the paralogs of two protein families, and improves the prediction of the structure of some protein complexes. Finally, I will show that these models have promising generative properties.