Sequence scoring and clustering

In this notebook, we illustrate how to use the SequenceScoringTool to score sequences for humanness, assign them to clusters in the humanness model, and retrieve the corresponding cluster for a sequence.

[1]:

from antpack import SequenceScoringTool, SingleChainAnnotator
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

score_tool = SequenceScoringTool(normalization="none")

my_sequence = "DIELTQSPASLSASVGETVTITCQASENIYSYLAWHQQKQGKSPQLLVYNAKTLAGGVSSRFSGSGSGTHFSLKIKSLQPEDFGIYYCQHHYGILPTFGGGTKLEIK"

Note that we don’t have to tell score_tool whether the sequence is heavy or light – it will figure that out using SingleChainAnnotator.

There are various flags we can set to exclude gaps, terminal deletions, CDR3, or a user-defined list of positions. Excluding cdr3 especially for heavy chain improves the model’s ability to distinguish between human and nonhuman very slightly. Excluding other regions can be useful if 1) the sequence has large deletions we’d like to ignore or 2) we just want to score one specific region, e.g. a framework region. We can get a mask that excludes all except a specific region by calling get_standard_mask, e.g. score_tool.get_standard_mask(chain_type="H", region="framework_1") will get a mask that excludes all regions of the sequence except IMGT-defined framework 1. This can then be passed to score_seqs.

[2]:

scores = score_tool.score_seqs([my_sequence], mask_terminal_dels = False, mask_gaps = False,
                             mask_cdr3 = False, mode="score", custom_light_mask = None,
                              custom_heavy_mask = None)
print(scores)

[-144.48811766]

The sequence scoring tool can be set to normalize scores, but this is mostly useful if combining scores across different chains or scoring specific regions. Here we are using an unnormalized score. Refer to the “Humanness evaluation” section of the docs (see main page) and you’ll notice that -144 is low. This sequence is not very human. That makes sense because this is the light chain of Abagovamab, a mouse antibody. Next, we’ll find the closest cluster to the sequence in the light chain model (since this is a light chain) and see what this can tell us about what regions of the sequence are less human.

[3]:

best_cluster_idx, chain_name = score_tool.get_closest_clusters(my_sequence, nclusters = 1, return_ids_only = True)

It looks like the closest cluster to this sequence is cluster 343.

[4]:

print(best_cluster_idx)

[343]

Let’s pull cluster 343.

[5]:

cluster_probs, cluster_weight, aa_list = score_tool.retrieve_cluster(343, chain_type = chain_name)

Each cluster in the mixture model is itself a probability distribution of length 141 for light chains and length 173 for heavy (the length of the IMGT numbered input sequence, after incorporating positions for insertions that are common in the training set even if not present in this particular sequence).

[6]:

cluster_probs.shape

[6]:

(1, 141, 21)

If we visualize this particular cluster, it looks like this. Note again that it’s a distribution and we could in principle sample from it to generate more sequences.

[7]:

plt.imshow(cluster_probs[0,:,:], aspect="auto", interpolation="none")
plt.ylabel("position")
plt.xlabel("Amino acid")
plt.xticks(np.arange(21), aa_list)
plt.colorbar()

[7]:

<matplotlib.colorbar.Colorbar at 0x76bf209e4d00>

We want to see the probability of each amino acid in the input sequence given this cluster, ignoring the IMGT positions that are common in some training sequences but are not filled here. To do that, we’ll use score_tool to convert the input sequence to an array, and mask any position in this array that has the value 20 (20 corresponds to gap).

[8]:

chain_name, seq_arr = score_tool.convert_sequence_to_array(my_sequence)
print(seq_arr)

seq_arr = seq_arr.flatten()
mask = seq_arr != 20
relevant_probs = cluster_probs[0,mask,:]
seq_arr = seq_arr[mask]

[[ 2  7  3  9 16 13 15 12  0 15  9 15  0 15 17  5  3 16 17 16  7 16  1 13
   0 15  3 11  7 20 20 20 20 20 20 20 20 20 20 19 15 19  9  0 18  6 20 13
  13  8 13  5  8 15 12 13  9  9 17 19 11  0 20 20 20 20 20 20 20  8 16  9
   0  5  5 17 15 20 15 14  4 15  5 15  5 20 20 15  5 16  6  4 15  9  8  7
   8 15  9 13 12  3  2  4  5  7 19 19 20  1 13  6  6 19  5 20 20 20 20 20
  20 20 20 20 20 20 20  7  9 12 16  4  5  5  5 16  8  9  3  7  8]]

[ ]:

[9]:

plt.imshow(relevant_probs, aspect="auto", interpolation="none")
plt.ylabel("position")
plt.xlabel("Amino acid")
plt.xticks(np.arange(21), aa_list)
plt.colorbar()

[9]:

<matplotlib.colorbar.Colorbar at 0x76bf1f764a00>

Better. Now let’s see the individual score for each amino acid in the sequence.

[10]:

plt.style.use("bmh")
plt.bar(np.arange(seq_arr.shape[0]), relevant_probs[np.arange(seq_arr.shape[0]), seq_arr])
plt.ylabel("Probability")
plt.xlabel("Position in sequence")
plt.title("Probability of each amino acid in my_sequence\ngiven cluster 343")
plt.show()

plt.bar(np.arange(seq_arr.shape[0]), np.log(relevant_probs[np.arange(seq_arr.shape[0]), seq_arr]), color="red")
plt.ylabel("Log probability")
plt.xlabel("Position in sequence")
plt.title("Log probability of each amino acid in my_sequence\ngiven cluster 343")
plt.show()

[11]:

numbering, _, _, _ = score_tool.aligner.analyze_seq(my_sequence)

[12]:

worst_position = np.argmin(np.log(relevant_probs[np.arange(seq_arr.shape[0]), seq_arr]))
print(my_sequence[73])
print(numbering[73])

K
90

The most unfavorable position in the sequence is 73 (numbering from zero) in the original sequence, which corresponds to IMGT 90. Here the sequence has K, while T is much more common in humans both in general and in this cluster.

[13]:

_ = plt.bar(np.arange(21), relevant_probs[73,:])
_ = plt.xticks(np.arange(21), aa_list)

[ ]:

Notice that some positions have very high probability, others do not. One way to humanize this sequence would be to make it more like cluster 343, i.e. to change some of the amino acids in the sequence to an amino acid that’s more likely at that position given cluster 343. We could use any other cluster in the light mixture model, of course, and it’s often a good idea to look at what some other clusters close to this sequence look like as well. We can call get_closest_clusters to get e.g. the 2 or 3 clusters closest to this sequence.

Alternatively, we can use the built-in humanization tool, which uses a very generic procedure that we will illustrate here.

Of these two approaches, we recommend the more tedious DIY extract close clusters and figure out which residues you’d like to change approach. Humanization involves some important tradeoffs, and it’s a good idea to think about these carefully. Still, we’ll illustrate the build-in generic approach here just for fun.

[14]:

# Any value <= 1 for s_thresh will yield a straight CDR graft. That will improve humanness as much as
# possible, but will change large regions of the sequence.
from antpack import HumanizationTool

h_tool = HumanizationTool()
score1, mutations1, sequence1 = h_tool.suggest_mutations(my_sequence, s_thresh = 1)

score2, mutations2, sequence2 = h_tool.suggest_mutations(my_sequence, s_thresh = 1.1)
score3, mutations3, sequence3 = h_tool.suggest_mutations(my_sequence, s_thresh = 1.25)

[15]:

print(f"Straight graft: {len(mutations1)} changes to original sequence, score {score1}")
print(f"1.10 threshold: {len(mutations2)} changes to original sequence, score {score2}")
print(f"1.25 threshold: {len(mutations3)} changes to original sequence, score {score3}")

Straight graft: 19 changes to original sequence, score -77.4538374417135
1.10 threshold: 15 changes to original sequence, score -83.44778045374068
1.25 threshold: 10 changes to original sequence, score -96.68713332179915

Any of these options yields a large improvement in the humanness score for this chain, but with different numbers of mutations to the input.

[ ]:

[ ]: