Handling DNA sequences

All of the existing antibody numbering schemes were designed to work with amino acid sequences. Consequently, while it’s possible to align germline genes to DNA sequences, this is slower and for most (though not all) workflows does not offer a significant advantage.

AntPack offers some tools to facilitate translating DNA sequences into amino acids using the DNASeqTranslator. If you know the reading frame and whether to use forward or reverse complement for a given sequence, the DNASeqTranslator can translate it for you (which is fairly trivial). More importantly, however, the DNASeqTranslator can quickly figure out which reading frame is correct and whether to use forward or reverse complement if you don’t already know. It does this by checking all three possible reading frames (and doing the same for the reverse complement if you indicate that the reverse complement may contain the sequence) for kmers which are common in antibody / TCR sequences. This check can be done very quickly and thus makes it easy to determine the correct reading frame without doing any expensive alignments.

Currently DNASeqTranslator supports DNA sequences consisting of uppercase A, C, T, G or N (any codon containing N is translated to X, which is an allowed letter for the AntPack numbering tools). It does not accept sequences containing gaps.

IMPORTANT: DNASeqTranslator may return amino acid sequences which contain stop codons, which are not allowed inputs for the AntPack numbering / humanization tools. It’s a good idea then to check the sequences that are returned for stop codons and if found decide how you want to handle these (depending on the kind of data you’re working with and your application).

For more on how to use DNASeqTranslator, see below.

class antpack.DNASeqTranslator

Contains functions that determine the correct reading frame and complement to use for an input DNA sequence and translates the input DNA sequence to protein / AA.

__init__(self, arg: collections.abc.Set[str], /) → None

translate_dna_known_rf

Converts a DNA sequence to protein when the reading frame is known and it is known whether the sequence is in the forward or reverse complement. This function translates the input sequence using the reading frame and complement you specify. This is faster than unknown_rf, but unknown_rf may be more useful in situations (e.g. PacBio reads) where you do not know the reading frame and forward or reverse complement position.

CAUTION: The amino acid sequence generated by this function may contain stop codons, which are currently rejected by AntPack numbering tools. You may want to check output sequences for the presence of stop codons.

Parameters:

sequence (str) – An uppercase sequence containing A, C, G and T. N is also allowed but should be used sparingly.
reading_frame (int) – One of 0, 1, or 2. Indicates how many positions forward to “slide” before starting translation.
reverse_complement (bool) – If True, the reverse complement is translated instead of the forward complement.

Returns:

translated_seq (str) – The translated AA sequence from the input.

translate_dna_unknown_rf

Converts a DNA sequence to protein when the reading frame is unknown and/or the antibody sequence may be in the forward or reverse complement. This function checks each possible reading frame (and if indicated the possible reading frames in the reverse complement) to see which is most likely to contain the mAb sequence based on the presence / absence of kmers common in mAbs and TCRs. The DNA sequence must consist of only A, C, T, G and N (any codon containing N will be translated to X, which is an allowed letter in the AntPack numbering tools) and should be uppercase letters. If you know which reading frame and/or complement the mAb sequence is in, it is generally faster to use translate_dna_known_rf instead, since it does not check multiple reading frames as this function does.

CAUTION: The amino acid sequence generated by this function may contain stop codons, which are currently rejected by AntPack numbering tools. You may want to check output sequences for the presence of stop codons.

Parameters:

sequence (str) – An uppercase sequence containing A, C, G and T. N is also allowed but should be used sparingly.
check_reverse_complement (bool) – If True, the reverse complement is also checked for a possible heavy / light chain.

Returns:

translated_seq (str) – The function checks the possible reading frames and reverse complement (if indicated) for kmers common in heavy / light chains and returns the translated AA sequence corresponding to the best match found.