Generating new antibody sequences

The mixture model used by AntPack has over 1800 clusters for heavy chains and 1300 for light. Each cluster is itself a distribution that you can sample from. By retrieving the cluster that’s most similar to an input sequence, you can generate new sequences that contain specific features and use this approach to build small synthetic libraries.

First, create a scoring tool::

from antpack import SequenceScoringTool
scoring_tool = SequenceScoringTool()

Next, take your sequence of interest and find the closest cluster(s) to it in the mixture models in the scoring tool. You’ll use mode='assign' to do this. Then get the parameters for that cluster.

class antpack.SequenceScoringTool(offer_classifier_option=False, normalization='none')

Tool for scoring sequences.

retrieve_cluster(cluster_id, chain_type)

A convenience function to get the per-position probabilities associated with a particular cluster.

Parameters:

cluster_id (int) – The id number of the cluster to retrieve. Can be generated by calling self.get_closest_clusters or self.batch_score_seqs with mode = “assign”.
chain_type (str) – One of “H”, “L”.

Returns:

mu_mix (np.ndarray) – An array of shape (1, sequence_length, 21), where 21 is the number of possible AAs. The clusters are sorted in order from most to least likely given the input sequence.
mixweights (float) – The probability of this cluster in the mixture.
aas (list) – A list of amino acids in standard order. The last dimension of mu_mix corresponds to these aas in the order given.

score_seqs(seq_list, mask_cdr3: bool = False, custom_light_mask: list | None = None, custom_heavy_mask: list | None = None, mask_terminal_dels: bool = False, mask_gaps: bool = False, mode: str = 'score')

Scores a list of sequences in batches or assigns them to clusters. Can be used in conjunction with a user-supplied mask (for positions to ignore) and in conjunction with Substantially faster than single seq scoring but does not offer the option to retrieve diagnostic infoCan also be used to assign a large number of sequences to clusters as well.

Parameters:

seq_list (str) – The list of input sequences. May contain both heavy and light.
mask_cdr3 (bool) – If True, ignore IMGT-defined CDR3 when assigning a score. CDR3 is not distinctive across species so this is often useful. Ignored if mode is ‘assign’, ‘assign_no_weights’.
custom_light_mask (list) – Either None or a list of strings indicating IMGT positions to ignore. Use self.get_standard_positions and/or self.get_standard_mask to construct a mask. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
custom_heavy_mask (list) – Either None or a list of strings indicating IMGT positions to ignore. Use self.get_standard_positions and/or self.get_standard_mask to construct a mask. This can be useful if you just want to score a specific region, or if there is a large deletion that should be ignored.
mask_terminal_dels (bool) – If True, N and C-terminal deletions are masked when calculating a score or assigning to a cluster. Useful if there are large unusual deletions at either end of the sequence that you would like to ignore when scoring.
mask_gaps (bool) – If True, all non-filled IMGT positions in the sequence are ignored when calculating the score. This is useful when your sequence has unusual deletions and you would like to ignore these.
mode (str) – One of ‘score’, ‘assign’, ‘assign_no_weights’, ‘classifier’. If score, returns the human generative model score. If ‘assign’, provides the most likely cluster number for each input sequence. If ‘assign_no_weights’, assigns the closest cluster ignoring mixture weights, so that the closest cluster is assigned even if that cluster is a low-probability one. If ‘classifier’, assigns a score using the Bayes’ rule classifier, which also takes into account some info regarding other species. ‘classifier’ is not a good way to score sequences in general because it only works well for sequences of known origin, so it should only be used for testing.

Returns:

output_scores (np.ndarray) – log( p(x) ) for all input sequences.

Each cluster you retrieve is a distribution and you can easily sample from it or use it to visualize which sections of your input sequence are least human / most problematic. See the example notebooks on the main page to see how to do this.

It is important to realize that sampling from a specified cluster can generate low-probability sequences, i.e. sequences that are unlikely to be human, in the same way that sampling from a normal distribution can occasionally generate outliers. It may be useful to score any sequences generated in this fashion using the scoring tool to ensure they are highly human. Alternatively, you may want to use the highest-probability amino acid at each position in a selected cluster except for the CDRs and sample from the distribution for those regions specifically.