Assigning V and J genes
AntPack can find the most similar V and J genes by looking for the V and J amino acid sequences from a species of interest that have the highest percent identity. Note that there are some other tools that can generate a probability for each possible recombination scenario. For a detailed analysis, especially for repertoire data, these tools may be better. If you just want to get the most similar V and J genes to your amino acid sequence, AntPack can do this easily.
To do this, use the VJGeneTool. The tool can tell you the name of the most similar V and J genes for an input sequence. It can determine similarity using either percent identity or e-value (using the assigned numbering as the alignment). It can search a prespecified species or search against all species in its database (human, alpaca, mouse, rabbit) if preferred. You’ll need to number the sequence first which could be done with another tool (but is most easily done using AntPack). You can next if desired retrieve the sequence of the assigned vj genes.
Many tools try to assign a single V-gene and J-gene. In general this is not necessarily correct – it is not uncommon to find situations where more than one V-gene and J-gene have the same percent identity or very similar e-values. There are also some germline genes that have different DNA sequences but the same AA sequence. In these cases, AntPack returns a list of the v-genes and j-genes that achieved the same (or essentially equivalent) score, delimited or separated by the character “_”.
In some versions of AntPack, there was a supported option to use the OGRDB database in place of IMGT. This option is deprecated as of v0.3.8. Note that some sequences found in IMGT VQuest are excluded (pseudogenes, partial sequences, sequences that do not have ‘F’ in the functionality section etc.), so not all V and J genes in the IMGT db are in AntPack. You can find out when AntPack’s db was last updated using the VJGeneTool (see below).
Also see the example, which illustrates how to use these capabilities.
- class antpack.VJGeneTool(scheme='imgt')
Contains functionality needed to find the closest VJ genes for a given amino acid sequence and retrieve the sequence for those VJ genes. The date of the database used for these assignments is stored and can be retrieved if needed.
- __init__(scheme='imgt')
Class constructor.
- Parameters:
scheme (str) – One of ‘aho, ‘imgt’, ‘kabat’ or ‘martin’. Determines the numbering scheme that will be used when assigning vj genes based on alignments.
- assign_vj_genes
Assigns V and J genes for a sequence which has already been numbered, preferably by AntPack but potentially by some other tool. The database and numbering scheme specified when creating the object are used. CAUTION: Make sure the scheme used for numbering is the same used for the VJGeneTool.
- Parameters:
alignment (tuple) – A tuple containing (numbering, percent_identity, chain_name, error_message). This tuple is what you will get as output if you pass sequences to the analyze_seq method of SingleChainAnnotator or PairedChainAnnotator.
sequence (str) – A sequence containing the usual 20 amino acids – no gaps. X is also allowed but should be used sparingly.
species (str) – Currently must be one of ‘human’, ‘mouse’, ‘alpaca’, ‘rabbit’ or ‘unknown’. For TCRs only ‘human’, ‘mouse’, ‘unknown’ are allowed. If ‘unknown’, all species are checked to find the closest match. Note that ‘unknown’ will be slightly slower for this reason.
mode (str) – One of ‘identity’, ‘evalue’. If ‘identity’ the highest percent identity sequence(s) are identified. If ‘evalue’ the lowest e-value (effectively best BLOSUM score) sequence(s) are identified.
- Returns:
v_gene (str) – The closest V-gene name(s).
j_gene (str) – The closest J-gene name(s).
v_pident (float) – If mode is ‘identity’, the number of positions at which the numbered sequence matches the v-gene divided by the total number of non-blank positions in the v-gene. If mode is ‘evalue’, the best BLOSUM score (this can be converted to an e-value). If more than one v-gene with the same score is found, multiple v-genes are returned as a single string delimited with ‘_’ to separate the different v-genes.
j_pident (float) – If mode is ‘identity’, the number of positions at which the numbered sequence matches the j-gene divided by the total number of non-blank positions in the j-gene. If mode is ‘evalue’, the best BLOSUM score (this can be converted to an e-value). If more than one j-gene with the same score is found, multiple j-genes are returned as a single string delimited with ‘_’ to separate the different j-genes.
species (str) – The species. This will be the same as the input species UNLESS your specified input species is unknown, in which case the species that was identified will be returned.
- get_vj_gene_sequence
Retrieves the amino acid sequence of a specified V or J gene, if it is in the latest version of the specified database in this version of AntPack. You can use assign_vj_genes and this function to see what the VJ sequences are (if needed).
- Parameters:
query_name (str) – A valid V or J gene name, as generated by for example assign_sequence.
species (str) – One of ‘human’, ‘mouse’, ‘alpaca’, ‘rabbit’.
- Returns:
sequence (str) – The amino acid sequence of the V or J gene that was requested, gapped to be length 128 consistent with the IMGT numbering scheme. If that V or J gene name does not match anything, None is returned.
- retrieve_db_dates()
Returns the dates when each VJ gene database used for this assignment was last updated by downloading from IMGT or OGRDB.