Handling DNA sequences
All of the existing antibody numbering schemes were designed to work with amino acid sequences. Consequently, while it’s possible to align germline genes to DNA sequences, this is slower and for most (though not all) workflows does not offer a significant advantage.
AntPack offers some tools to facilitate translating DNA
sequences into amino acids using the DNASeqTranslator
.
If you know the reading frame and whether to use forward or
reverse complement for a given sequence, the DNASeqTranslator
can translate it for you (which is fairly trivial). More importantly,
however, the DNASeqTranslator
can quickly figure out which
reading frame is correct and whether to use forward or reverse
complement if you don’t already know. It does this by checking
all three possible reading frames (and doing the same for the
reverse complement if you indicate that the reverse complement
may contain the sequence) for kmers which are common in
antibody / TCR sequences. This check can be done very quickly and
thus makes it easy to determine the correct reading frame
without doing any expensive alignments.
Currently DNASeqTranslator supports DNA sequences consisting of uppercase A, C, T, G or N (any codon containing N is translated to X, which is an allowed letter for the AntPack numbering tools). It does not accept sequences containing gaps.
IMPORTANT: DNASeqTranslator
may return amino acid sequences
which contain stop codons, which are not allowed inputs for the
AntPack numbering / humanization tools. It’s a good idea then to
check the sequences that are returned for stop codons and if
found decide how you want to handle these (depending on the kind
of data you’re working with and your application).
For more on how to use DNASeqTranslator
, see below.
- class antpack.DNASeqTranslator
Contains functions that determine the correct reading frame and complement to use for an input DNA sequence and translates the input DNA sequence to protein / AA.
- __init__(self, arg: collections.abc.Set[str], /) None
- translate_dna_known_rf
Converts a DNA sequence to protein when the reading frame is known and it is known whether the sequence is in the forward or reverse complement. This function translates the input sequence using the reading frame and complement you specify. This is faster than unknown_rf, but unknown_rf may be more useful in situations (e.g. PacBio reads) where you do not know the reading frame and forward or reverse complement position.
CAUTION: The amino acid sequence generated by this function may contain stop codons, which are currently rejected by AntPack numbering tools. You may want to check output sequences for the presence of stop codons.
- Parameters:
sequence (str) – An uppercase sequence containing A, C, G and T. N is also allowed but should be used sparingly.
reading_frame (int) – One of 0, 1, or 2. Indicates how many positions forward to “slide” before starting translation.
reverse_complement (bool) – If True, the reverse complement is translated instead of the forward complement.
- Returns:
translated_seq (str) – The translated AA sequence from the input.
- translate_dna_unknown_rf
Converts a DNA sequence to protein when the reading frame is unknown and/or the antibody sequence may be in the forward or reverse complement. This function checks each possible reading frame (and if indicated the possible reading frames in the reverse complement) to see which is most likely to contain the mAb sequence based on the presence / absence of kmers common in mAbs and TCRs. The DNA sequence must consist of only A, C, T, G and N (any codon containing N will be translated to X, which is an allowed letter in the AntPack numbering tools) and should be uppercase letters. If you know which reading frame and/or complement the mAb sequence is in, it is generally faster to use translate_dna_known_rf instead, since it does not check multiple reading frames as this function does.
CAUTION: The amino acid sequence generated by this function may contain stop codons, which are currently rejected by AntPack numbering tools. You may want to check output sequences for the presence of stop codons.
- Parameters:
sequence (str) – An uppercase sequence containing A, C, G and T. N is also allowed but should be used sparingly.
check_reverse_complement (bool) – If True, the reverse complement is also checked for a possible heavy / light chain.
- Returns:
translated_seq (str) – The function checks the possible reading frames and reverse complement (if indicated) for kmers common in heavy / light chains and returns the translated AA sequence corresponding to the best match found.