Antibody and TCR numbering in AntPack

To start, use the SingleChainAnnotator or PairedChainAnnotator tools. You can use these to number sequences, “trim” the resulting alignment and convert a list of sequences into a fixed-length multiple sequence alignment or MSA::

from antpack import SingleChainAnnotator, PairedChainAnnotator
aligner = SingleChainAnnotator(chains=["H", "K", "L"], scheme="imgt")
paired_aligner = PairedChainAnnotator(scheme = "imgt",
                      receptor_type="mab")

Both tools have the same basic methods / functions available.

For single chains: if you don’t know what type of chain you’re working with, leave chains as default and SingleChainAnnotator will figure out the chain type. If you DO know that all of your chains are either heavy [“H”] or light [“K”, “L”] set SingleChainAnnotator to only look for that chain. If you are interested in TCRs, supply chains as ["A", "B", "D", "G"]`. Note that TCR numbering is somewhat slower than antibody numbering in current versions of AntPack (we will likely improve TCR numbering speed in future).

PairedChainAnnotator is designed to work with sequences that contain both a light and a heavy chain (in any order) but can also handle single chains. Keep in mind that PairedChainAnnotator will try to find two chains in the input sequence, so your clue that there is only one chain present will be a very low percent identity and/or an error message for one of the two chains. It is a little slower than SingleChainAnnotator because it has to do some additional operations. If you want to look at TCRs instead of antibodies, change receptor_type to tcr.

Some prior versions accepted an option called compress_init_gaps. This option is deprecated as of v0.3.6.

class antpack.SingleChainAnnotator(chains=['H', 'K', 'L'], scheme='imgt')
__init__(chains=['H', 'K', 'L'], scheme='imgt')

Class constructor.

Parameters:
  • chains (list) – A list of chains. Each must EITHER be one of “H”, “K”, “L” for antibodies or one of “A”, “B”, “D”, “G” for TCRs. If [“H”, “K”, “L”] (default) or [“A”, “B”, “D”, “G”] the annotator will automatically determine the most appropriate chain type for each input sequence. You cannot supply a mixture of TCR and antibody chains (e.g. [“H”, “A”]) – the list you supply must contain either TCR or antibody chains but not both.

  • scheme (str) – The numbering scheme. Must be one of “imgt”, “martin”, “kabat”, “aho”. If TCR chains are supplied, only “imgt” is accepted.

Raises:

ValueError – A ValueError is raised if unacceptable inputs are supplied.

analyze_seq

Numbers a single input sequence. A list of outputs from this function can be passed to build_msa if desired. The output from this function can also be passed to trim_alignment, to assign_cdr_labels and to the VJGeneTool as well.

Parameters:

sequence (str) – A string which is a sequence containing the usual 20 amino acids. X is also allowed but should be used sparingly.

Returns:

sequence_results (tuple) – A tuple of (sequence_numbering, percent_identity, chain_name, error_message). If no error was encountered, the error message is “”. An alignment with low percent identity (e.g. < 0.85) may indicate a sequence that is not really an antibody, that contains a large deletion, or is not of the selected chain type.

analyze_seqs

Numbers a list of input sequences. The outputs can be passed to other functions like build_msa, trim_alignment, assign_cdr_labels and the VJGeneTool if desired.

Parameters:

sequences (list) – A list of strings, each of which is a sequence containing the usual 20 amino acids. X is also allowed (although X should be used with caution; it would be impossible to correctly align a sequence consisting mostly of X for example).

Returns:

sequence_results (list) – A list of tuples of (sequence_numbering, percent_identity, chain_name, error_message). If no error was encountered, the error message is “”. An alignment with low percent identity (e.g. < 0.85) may indicate a sequence that is not really an antibody, that contains a large deletion, or is not of the selected chain type.

assign_cdr_labels

Assigns a list of labels “-”, “fmwk1”, “cdr1”, “fmwk2”, “cdr2”, “fmwk3”, “cdr3”, “fmwk4” to each amino acid in a sequence already annotated using the “analyze_seq” or “analyze_seqs” commands. The labels indicate which framework region or CDR each amino acid / position is in. This function can be used to assign CDRs with a different scheme than the one used to number the sequence if desired.

Parameters:
  • numbering (list) – A list containing valid codes for the scheme that was selected when this object was created. If you pass a sequence to the analyze_seq method of SingleChainAnnotator or PairedChainAnnotator, the numbering will be the first element of the tuple that is returned (or the first element of both tuples that are returned for PairedChainAnnotator).

  • chain (str) – A valid chain (e.g. ‘H’, ‘K’, ‘L’, ‘A’). The assigned chain is the third element of the tuple returned by analyze_seq. For this function only, ‘K’ and ‘L’ are equivalent since they both refer to a light chain, so if your chain is light you can supply either for the same result.

  • scheme (str) – Either “” or a valid scheme. If “” (default), the scheme that is used is the same as the one selected when the annotator was constructed. Using a different scheme can enable you to “cross-assign” CDRs and number with one scheme while assigning CDRs with another. So if you create an annotator with “imgt” as the scheme then you are numbering using “imgt”, but by passing e.g. “kabat” to this function, you can use the kabat CDR definitions instead of the IMGT ones. Valid schemes for this function only are ‘imgt’, ‘aho’, ‘kabat’, ‘martin’, ‘north’. For TCRs only “” and “imgt” are accepted.

Returns:

region_labels (list) – A list of strings, each of which is one of “fmwk1”, “fmwk2”, “fmwk3”, “fmwk4”, “cdr1”, “cdr2”, “cdr3” or “-“. This list will be of the same length as the input alignment.

build_msa

Builds a multiple sequence alignment using a list of sequences and a corresponding list of tuples output by analyze_seq or analyze_seqs (e.g. from PairedChainAnnotator or SingleChainAnnotator).

Parameters:
  • sequences (list) – A list of sequences.

  • annotations (list) – A list of tuples, each containing (numbering, percent_identity, chain_name, error_message). These tuples are what you will get as output if you pass sequences to the analyze_seq or analyze_seqs methods of SingleChainAnnotator or PairedChainAnnotator.

  • add_unobserved_positions (bool) – If False, only positions observed for one or more sequences appear in the output. If True, by contrast, not just observed positions but all expected positions for a given numbering scheme appear in the output. If IMGT expected position 9 does not occur in any dataset sequence, for example, it will not appear in the output if this argument is False but will be added (and will be blank for all sequences) if True.

Returns:
  • position_codes (list) – A list of position codes from the appropriate numbering scheme.

  • aligned_seqs (list) – A list of strings – the input sequences all aligned to form an MSA.

sort_position_codes

Takes an input list of position codes for a specified scheme and sorts them. This is useful since for some schemes (e.g. IMGT) sorting is nontrivial, e.g. 112A goes before 112.

Parameters:

position_code_list (list) – A list of position codes. If ‘-’ is present it is filtered out and is not included in the returned list.

Returns:

sorted_codes (list) – A list of sorted position codes.

trim_alignment

Takes as input a sequence and a tuple produced by analyze_seq and trims off any gap regions at the end that result when there are amino acids on either end which are not part of the numbered variable region. The output from analyze_seq can be fed directly to this function.

Parameters:
  • sequence (str) – The input sequence.

  • alignment (tuple) – A tuple containing (numbering, percent_identity, chain_name, error_message). This tuple is what you will get as output if you pass sequences to the analyze_seq method of SingleChainAnnotator or PairedChainAnnotator.

Returns:
  • trimmed_seq (str) – The trimmed input sequence.

  • trimmed_numbering (list) – The first element of the input tuple, the numbering, but with all gap regions trimmed off the end.

  • exstart (int) – The first untrimmed position in the input sequence.

  • exend (int) – The last untrimmed position in the input sequence. The trimmed sequence is sequence[exstart:exend].

class antpack.PairedChainAnnotator(scheme='imgt', receptor_type='mab')
__init__(scheme='imgt', receptor_type='mab')

Class constructor.

Parameters:
  • scheme (str) – The numbering scheme. Must be one of “imgt”, “martin”, “kabat”, “aho”. If receptor_type is ‘tcr’ only “imgt” is accepted.

  • receptor_type (str) – One of “mab”, “tcr”. Default is “mab” (antibody).

Raises:

RuntimeError – A RuntimeError is raised if unacceptable inputs are supplied.

analyze_seq

Extracts and numbers the variable chain regions from a sequence that is may contain both a light (‘K’, ‘L’ for antibodies, ‘B’ or ‘D’ for TCRs) region and a heavy (‘H’ for antibodies, ‘A’ or ‘G’ for TCRs) region. The extracted light or heavy chains that are returned can be passed to other tools like build_msa, trim_alignment, assign_cdr_labels and the VJGeneTool.

Parameters:

sequence (str) – A string which is a sequence containing the usual 20 amino acids. X is also allowed but should be used sparingly.

Returns:
  • heavy_chain_result (tuple) – A tuple of (numbering, percent_identity, chain_name, error_message). Numbering is the same length as the input sequence. A low percent identity or an error message may indicate a problem with the input sequence. The error_message is “” unless some error occurred.

  • light_chain_result (tuple) – A tuple of (numbering, percent_identity, chain_name, error_message). Numbering is the same length as the input sequence. A low percent identity or an error message may indicate a problem with the input sequence. The error_message is “” unless some error occurred.

analyze_seqs

Extracts and numbers the variable chain regions from a list of sequences may contain both a light (‘K’, ‘L’ for antibodies or ‘B’, ‘D’ for TCRs) region and a heavy (‘H’ for antibodies or ‘B’, ‘D’ for TCRs) region. The extracted light or heavy chains that are returned can be passed to other tools like build_msa, trim_alignment, assign_cdr_labels and the VJGeneTool.

Parameters:

sequence (str) – A string which is a sequence containing the usual 20 amino acids. X is also allowed but should be used sparingly.

Returns:
  • heavy_chain_results (list) – A list of tuples of (numbering, percent_identity, chain_name, error_message). Numbering is the same length as the corresponding sequence. A low percent identity or an error message may indicate a problem with an input sequence. Each error_message is “” unless some error occurred for that sequence.

  • light_chain_results (list) – A list tuples of (numbering, percent_identity, chain_name, error_message). Numbering is the same length as the corresponding sequence. A low percent identity or an error message may indicate a problem with an input sequence. Each error message is “” unless some error occurred for that sequence.

assign_cdr_labels

Assigns a list of labels “-”, “fmwk1”, “cdr1”, “fmwk2”, “cdr2”, “fmwk3”, “cdr3”, “fmwk4” to each amino acid in a sequence already annotated using the “analyze_seq” or “analyze_seqs” commands. The labels indicate which framework region or CDR each amino acid / position is in. This function can be used to assign CDRs with a different scheme than the one used to number the sequence if desired.

Parameters:
  • numbering (list) – A list containing valid codes for the scheme that was selected when this object was created. If you pass a sequence to the analyze_seq method of SingleChainAnnotator or PairedChainAnnotator, the numbering will be the first element of the tuple that is returned (or the first element of both tuples that are returned for PairedChainAnnotator).

  • chain (str) – A valid chain (e.g. ‘H’, ‘K’, ‘L’, ‘A’). The assigned chain is the third element of the tuple returned by analyze_seq. For this function only, ‘K’ and ‘L’ are equivalent since they both refer to a light chain, so if your chain is light you can supply either for the same result.

  • scheme (str) – Either “” or a valid scheme. If “” (default), the scheme that is used is the same as the one selected when the annotator was constructed. Using a different scheme can enable you to “cross-assign” CDRs and number with one scheme while assigning CDRs with another. So if you create an annotator with “imgt” as the scheme then you are numbering using “imgt”, but by passing e.g. “kabat” to this function, you can use the kabat CDR definitions instead of the IMGT ones. Valid schemes for this function only are ‘imgt’, ‘aho’, ‘kabat’, ‘martin’, ‘north’. For TCRs only “” and “imgt” are accepted.

Returns:

region_labels (list) – A list of strings, each of which is one of “fmwk1”, “fmwk2”, “fmwk3”, “fmwk4”, “cdr1”, “cdr2”, “cdr3” or “-“. This list will be of the same length as the input alignment.

build_msa

Builds a multiple sequence alignment using a list of sequences and a corresponding list of tuples output by analyze_seq or analyze_seqs (e.g. from PairedChainAnnotator or SingleChainAnnotator).

Parameters:
  • sequences (list) – A list of sequences.

  • annotations (list) – A list of tuples, each containing (numbering, percent_identity, chain_name, error_message). These tuples are what you will get as output if you pass sequences to the analyze_seq or analyze_seqs methods of SingleChainAnnotator or PairedChainAnnotator.

  • add_unobserved_positions (bool) – If False, only positions observed for one or more sequences appear in the output. If True, by contrast, not just observed positions but all expected positions for a given numbering scheme appear in the output. If IMGT expected position 9 does not occur in any dataset sequence, for example, it will not appear in the output if this argument is False but will be added (and will be blank for all sequences) if True.

Returns:
  • position_codes (list) – A list of position codes from the appropriate numbering scheme.

  • aligned_seqs (list) – A list of strings – the input sequences all aligned to form an MSA.

sort_position_codes

Takes an input list of position codes for a specified scheme and sorts them. This is useful since for some schemes (e.g. IMGT) sorting is nontrivial, e.g. 112A goes before 112.

Parameters:

position_code_list (list) – A list of position codes. If ‘-’ is present it is filtered out and is not included in the returned list.

Returns:

sorted_codes (list) – A list of sorted position codes.

trim_alignment

Takes as input a sequence and a tuple produced by analyze_seq and trims off any gap regions at the end that result when there are amino acids on either end which are not part of the numbered variable region. The output from analyze_seq can be fed directly to this function.

Parameters:
  • sequence (str) – The input sequence.

  • alignment (tuple) – A tuple containing (numbering, percent_identity, chain_name, error_message). This tuple is what you will get as output if you pass sequences to the analyze_seq method of SingleChainAnnotator or PairedChainAnnotator.

Returns:
  • trimmed_seq (str) – The trimmed input sequence.

  • trimmed_numbering (list) – The first element of the input tuple, the numbering, but with all gap regions trimmed off the end.

  • exstart (int) – The first untrimmed position in the input sequence.

  • exend (int) – The last untrimmed position in the input sequence. The trimmed sequence is sequence[exstart:exend].

Notice that these tools do not have a multithreading option. That’s because the best way to do multithreading / multiprocessing for antibody numbering for a large number of sequences is to split the antibody sequences up into batches, and you can do this easily by using Python multiprocessing. In each process, create a SingleChainAnnotator and use that to number one batch of the sequences.

Aho, IMGT, Martin (“modern Chothia”) and Kabat are supported numbering schemes.

Notice that it’s easy to convert the output of either annotator into a fixed-length MSA by using build_msa. If you have a large set of sequences that’s too large to work with in memory, however, build_msa obviously will not help. In this case, you can instead loop over the sequences, number each of them and keep track of all the unique position codes you’ve seen (e.g. by adding them to a set). The sort_position_codes function can then sort the resulting list of unique position codes to the correct ordering for that scheme. You can then create a dictionary mapping position codes to positions in a fixed length array, e.g.::

position_dict = {k:i for i, k in enumerate(sorted_position_codes)}

and then use this when writing sequences to file or e.g. one-hot encoding them in an array. For examples of how to do this, see the numbering example on the main page. If “-” is present, sort_position_codes will always remove it when sorting; thus, if the list you pass to this function contains ‘-’, that character will be removed before sorting.

assign_cdr_labels is useful to figure out which portions of a numbered sequence are cdr or framework. As of v0.3.8, you can use a different set of CDR definitions than the numbering scheme. You can for example number using IMGT by creating a PairedChainAnnotator or SingleChainAnnotator with "imgt" as the scheme and then call assign_cdr_labels but pass it an argumen of kabat for the cdr assignment scheme. This will number your sequences with IMGT but use Kabat CDR definitions.