VJ Gene Assignment

In this example, we’ll see how to find the closest V and J genes in AntPack’s database for an input sequence, get the sequences of those VJ genes, and see the date when AntPack’s database was last updated.

It’s important that the VJ tool and the SingleChainAnnotator both use the same scheme. Using different schemes for each could result in bad assignments. If you’re interested in looking at TCRs, you don’t have to pass anything different to this class – it can automatically look at TCRs as well.

[1]:
from antpack import VJGeneTool, SingleChainAnnotator

vj_tool = VJGeneTool(scheme="imgt")
sc_aligner = SingleChainAnnotator(scheme="imgt")
[2]:
test_sequence = "VQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARYDGIYGELDFWGQGTLVTVSS"

We must first number the sequence first using SingleChainAnnotator, and then pass that numbering to vj_tool.assign_vj_genes(). Note that if we enter a sequence with invalid amino acids (e.g. -), or we supply an invalid species (not one of ‘human’, ‘mouse’, ‘alpaca’), or we supply other invalid input, the tool will return “” for both V and J genes.

assign_vj_genes takes four arguments. The third indicates the species whose vj genes we would like to use. You can pass "unknown" if you want it to check all available species. Note that this is slower unsurprisingly, so if you’re working with a large dataset and speed is your goal, you’re better off passing a specific species if you already know which species to look for. The fourth indicates whether we would like to use percent identity or e-value to determine which V- or J-genes are most similar. Calculating the e-value is done using a scoring matrix (in this case, BLOSUM62), and for simplicity, the v_identity and j_identity that are returned when using e-value are the BLOSUM scores (which can easily be converted to e-values). We’ll use identity here for simplicity.

[3]:
annotation = sc_aligner.analyze_seq(test_sequence)
v_gene, j_gene, v_percent_identity, j_percent_identity, species = vj_tool.assign_vj_genes(annotation, test_sequence,
                                                                                 "human", "identity")
[4]:
print(v_gene)
print(j_gene)
print(v_percent_identity)
print(j_percent_identity)
print(species)
IGHV1-69*01_IGHV1-69*12_IGHV1-69*13_IGHV1-69*19_IGHV1-69D*01
IGHJ4*01_IGHJ4*02_IGHJ4*03_IGHJ5*01_IGHJ5*02
0.8877551020408163
0.8571428571428571
human

Notice that AntPack here returned multiple v-genes, all separated by a “_” delimiter. That’s because these v-genes all had the same percent identity, so there is no reason to assume one is “more correct” than the other. It’s less likely to have a tie like this when using e-value (but still possible).

If we need to, we can see what the sequence of these genes are. AntPack stores those sequences in its internal db pre-aligned using the IMGT numbering scheme, so each sequence will be length 128 with gaps inserted as appropriate. We can use SingleChainAnnotator to number the sequence, then we can do some simple manipulation to convert the numbered sequence to the same format so we can see how well it lines up. The IMGT scheme contains 128 positions (any letters above and beyond this are designated with a letter), so when annotating our input sequence to get it to match up to the V-gene, we just extract numbered positions where the number is in 1 through 128 as illustrated below.

[5]:
# Get the V and J gene sequences
vgene_seq = vj_tool.get_vj_gene_sequence("IGHV1-69*01", "human")
jgene_seq = vj_tool.get_vj_gene_sequence("IGHJ4*01", "human")

# Now let's prep our input sequence so it can be directly compared to the V and J gene.

ntool = SingleChainAnnotator()
numbering, _, _, _ = ntool.analyze_seq(test_sequence)

formatted_seq = ["-" for i in range(128)]
expected_positions = {str(i) for i in range(128)}

for ntoken, letter in zip(numbering, test_sequence):
    if ntoken in expected_positions:
        # We have to subtract 1 here because Python numbers from 0, IMGT numbers from 1.
        formatted_seq[int(ntoken) - 1] = letter

print(vgene_seq)
print(jgene_seq)
print("".join(formatted_seq))
QVQLVQSGA-EVKKPGSSVKVSCKASGGTF----SSYAISWVRQAPGQGLEWMGGIIPI--FGTANYAQKFQ-GRVTITADESTSTAYMELSSLRSEDTAVYYCAR----------------------
------------------------------------------------------------------------------------------------------------------FDYWGQGTLVTVSS
-VQLVQSGA-EVKKPGSSVKVSCKASGGTF----SSYAISWVRQAPGQGLEWMGGI--------IPIFQKFQ-GRVTITADESTSTAYMELSSLRSEDTAVYYCARYDGI-YGELDFWGQGTLVTVS-

If for some reason it is desired, we can pull AntPack’s full IMGT database to do some additional manipulations as illustrated below. This will be a tuple where the second element is lists of gene names by receptor type and the first element is lists of gene sequences by receptor type.

[6]:
full_db = vj_tool.get_seq_lists()
print(full_db[0]["human_IGHJ"])
print(full_db[1]["human_IGHJ"])
['------------------------------------------------------------------------------------------------------------------FQHWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDLWGRGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDVWGQGTMVTVSS', '------------------------------------------------------------------------------------------------------------------FDIWGQGTMVTVSS', '------------------------------------------------------------------------------------------------------------------FDYWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDYWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDYWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDSWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------FDPWGQGTLVTVSS', '------------------------------------------------------------------------------------------------------------------MDVWGQGTTVTVSS', '------------------------------------------------------------------------------------------------------------------MDVWGQGTTVTVSS', '------------------------------------------------------------------------------------------------------------------MDVWGKGTTVTVSS', '------------------------------------------------------------------------------------------------------------------MDVWGKGTTVTVSS', '------------------------------------------------------------------------------------------------------------------MDVWGKGTTVTVAS']
['IGHJ1*01', 'IGHJ2*01', 'IGHJ3*01', 'IGHJ3*02', 'IGHJ4*01', 'IGHJ4*02', 'IGHJ4*03', 'IGHJ5*01', 'IGHJ5*02', 'IGHJ6*01', 'IGHJ6*02', 'IGHJ6*03', 'IGHJ6*04', 'IGHJ6*05']

Finally, let’s see when AntPack’s VJ database was last updated. Note that AntPack’s VJ database is pulled from IMGT’s but with some exclusions – we exclude for example genes where the functionality is not “F” or the gene is partial. This also indicates which species and receptors are currently supported.

[7]:
vj_tool.retrieve_db_dates()
[7]:
{'mouse': {'IGHJ': '2025-03-14',
  'TRGJ': '2025-03-14',
  'IGKV': '2025-03-14',
  'TRBJ': '2025-03-14',
  'IGKJ': '2025-03-14',
  'TRDV': '2025-03-14',
  'TRAJ': '2025-03-14',
  'TRDJ': '2025-03-14',
  'IGHV': '2025-03-14',
  'TRAV': '2025-03-14',
  'IGLJ': '2025-03-14',
  'TRBV': '2025-03-14',
  'IGLV': '2025-03-14',
  'TRGV': '2025-03-14'},
 'human': {'TRBV': '2025-03-14',
  'IGHV': '2025-03-14',
  'IGKV': '2025-03-14',
  'IGLV': '2025-03-14',
  'TRAJ': '2025-03-14',
  'TRGV': '2025-03-14',
  'TRDV': '2025-03-14',
  'TRGJ': '2025-03-14',
  'IGHJ': '2025-03-14',
  'TRBJ': '2025-03-14',
  'TRAV': '2025-03-14',
  'IGLJ': '2025-03-14',
  'IGKJ': '2025-03-14',
  'TRDJ': '2025-03-14'},
 'alpaca': {'IGHV': '2025-03-14', 'IGHJ': '2025-03-14'},
 'rabbit': {'IGLJ': '2025-03-14',
  'IGKV': '2025-03-14',
  'IGKJ': '2025-03-14',
  'IGLV': '2025-03-14',
  'IGHV': '2025-03-14',
  'IGHJ': '2025-03-14'}}
[ ]:

[ ]:

[ ]: