Sequence numbering
In this notebook, we illustrate how to use the SingleChainAnnotator tool to determine whether a sequence is heavy or light (lambda, kappa) chain and number it. We next demonstrate how to use the PairedChainAnnotator, which can be used if your data is paired or single chains (but in general if you know your data is single chain prefer SingleChainAnnotator).
We also illustrate how to use AntPack to extract specific CDRs or framework regions and how to merge numbering for antibody sequences of different lengths so that you can write your sequences to a csv file as an MSA or encode them as a fixed-length array for ML.
[1]:
from antpack import SingleChainAnnotator, PairedChainAnnotator
my_sequence = "AAAAAAAEVHLQQSGAELMKPGASVKISCKASGYTFITYWIEWVKQRPGHGLEWIGDILPGSGSTNYNENFKGKATFTADSSSNTAYMQLSSLTSEDSAVYYCARSGYYGNSGFAYWGQGTLVTVSA"
additional_sequences = ["EVQLVQSGAEVKKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGIIYPGDSDTRYSPSFQGQVTISADKSISTAYLQWSSLKASDTAMYYCARPHYYDSLDAFDIWGQGTMVTVSS",
"ITLKESGPTLVKPTQTLTLTCTFSGFSLSTSGMGVSWIRQPPGKALEWLAHIYWDDDKRYNPSLKSRLTITKDTSKNQVVLTMTNMDPVDTATYYCARLYGFTYGFAYWGQGTLVTVSS",
"EVKLQESGPGKLQPSQTLSLTCSFSGFSLTTSGIGVGWIRQPSGKGLEWLAHIWWSASKYYNTALKSRLTISKDTSNNQVFLKIASVDTADTATYYCARAYYGNYGGYYFDYWGQGTTLTVSS"]
light_chain = "DIELTQSPASLSASVGETVTITCQASENIYSYLAWHQQKQGKSPQLLVYNAKTLAGGVSSRFSGSGSGTHFSLKIKSLQPEDFGIYYCQHHYGILPTFGGGTKLEIK"
# Let's make up a paired chain sequence by merging a heavy and light chain with a fairly arbitrary spacer between them.
paired_chain = "".join( [my_sequence, "GGGGGGGG", light_chain] )
First, let’s use SingleChainAnnotator to process sequences that likely only contain a single chain.
By passing [“H”, “K”, “L”] (the default) we ensure the annotator will align each sequence to heavy, kappa and lambda (kappa and lambda are different variants of light chains) and will determine the type of the chain based on which option returns the best alignment. If we KNOW our chains are all “H”, we could pass “H” as the only option, and this will improve speed slightly. In general however this isn’t necessary.
If we wanted to look at TCRs we could pass [“A”, “D”, “B”, “G”] (in any order) instead.
[2]:
sc_annotator = SingleChainAnnotator(["H", "K", "L"], scheme = "imgt")
[3]:
numbering, percent_identity, chain_type, err_message = sc_annotator.analyze_seq(my_sequence)
The numbering is a list of the same length as the input sequence where each element is either “-” (a gap, meaning there’s no numbering assignment that corresponds to that amino acid or a numbering assignment.
[4]:
print(f"{numbering}\n\n")
print(f"{my_sequence}\n\n")
print([(a, z) for a, z in zip(numbering, my_sequence)])
['-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128']
AAAAAAAEVHLQQSGAELMKPGASVKISCKASGYTFITYWIEWVKQRPGHGLEWIGDILPGSGSTNYNENFKGKATFTADSSSNTAYMQLSSLTSEDSAVYYCARSGYYGNSGFAYWGQGTLVTVSA
[('-', 'A'), ('-', 'A'), ('-', 'A'), ('-', 'A'), ('-', 'A'), ('-', 'A'), ('-', 'A'), ('1', 'E'), ('2', 'V'), ('3', 'H'), ('4', 'L'), ('5', 'Q'), ('6', 'Q'), ('7', 'S'), ('8', 'G'), ('9', 'A'), ('11', 'E'), ('12', 'L'), ('13', 'M'), ('14', 'K'), ('15', 'P'), ('16', 'G'), ('17', 'A'), ('18', 'S'), ('19', 'V'), ('20', 'K'), ('21', 'I'), ('22', 'S'), ('23', 'C'), ('24', 'K'), ('25', 'A'), ('26', 'S'), ('27', 'G'), ('28', 'Y'), ('29', 'T'), ('30', 'F'), ('35', 'I'), ('36', 'T'), ('37', 'Y'), ('38', 'W'), ('39', 'I'), ('40', 'E'), ('41', 'W'), ('42', 'V'), ('43', 'K'), ('44', 'Q'), ('45', 'R'), ('46', 'P'), ('47', 'G'), ('48', 'H'), ('49', 'G'), ('50', 'L'), ('51', 'E'), ('52', 'W'), ('53', 'I'), ('54', 'G'), ('55', 'D'), ('56', 'I'), ('57', 'L'), ('58', 'P'), ('59', 'G'), ('62', 'S'), ('63', 'G'), ('64', 'S'), ('65', 'T'), ('66', 'N'), ('67', 'Y'), ('68', 'N'), ('69', 'E'), ('70', 'N'), ('71', 'F'), ('72', 'K'), ('74', 'G'), ('75', 'K'), ('76', 'A'), ('77', 'T'), ('78', 'F'), ('79', 'T'), ('80', 'A'), ('81', 'D'), ('82', 'S'), ('83', 'S'), ('84', 'S'), ('85', 'N'), ('86', 'T'), ('87', 'A'), ('88', 'Y'), ('89', 'M'), ('90', 'Q'), ('91', 'L'), ('92', 'S'), ('93', 'S'), ('94', 'L'), ('95', 'T'), ('96', 'S'), ('97', 'E'), ('98', 'D'), ('99', 'S'), ('100', 'A'), ('101', 'V'), ('102', 'Y'), ('103', 'Y'), ('104', 'C'), ('105', 'A'), ('106', 'R'), ('107', 'S'), ('108', 'G'), ('109', 'Y'), ('110', 'Y'), ('111', 'G'), ('112', 'N'), ('113', 'S'), ('114', 'G'), ('115', 'F'), ('116', 'A'), ('117', 'Y'), ('118', 'W'), ('119', 'G'), ('120', 'Q'), ('121', 'G'), ('122', 'T'), ('123', 'L'), ('124', 'V'), ('125', 'T'), ('126', 'V'), ('127', 'S'), ('128', 'A')]
[ ]:
If we want to get just the part of the numbering that corresponds to the numbered region with c- and n-terminal nonnumbered regions removed, we can use trim_alignment, which returns the trimmed sequence, the trimmed numbering, and the start and end of the extracted region. trim_alignment expects a tuple – the tuple that is returned by analyze_seq for instance.
[5]:
my_sequence[7:127]
[5]:
'EVHLQQSGAELMKPGASVKISCKASGYTFITYWIEWVKQRPGHGLEWIGDILPGSGSTNYNENFKGKATFTADSSSNTAYMQLSSLTSEDSAVYYCARSGYYGNSGFAYWGQGTLVTVSA'
[6]:
alignment = (numbering, percent_identity, chain_type, err_message)
[7]:
print(sc_annotator.trim_alignment(my_sequence, alignment))
('EVHLQQSGAELMKPGASVKISCKASGYTFITYWIEWVKQRPGHGLEWIGDILPGSGSTNYNENFKGKATFTADSSSNTAYMQLSSLTSEDSAVYYCARSGYYGNSGFAYWGQGTLVTVSA', ['1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128'], 7, 127)
SingleChainAnnotator determined this was a heavy chain (“H”). “K” and “L” both correspond to light chains (kappa and lambda).
[8]:
print(chain_type)
H
A low percent identity (<< 0.8) could mean this isn’t an antibody sequence, contains large deletions, or some other issue.
[9]:
print(percent_identity)
0.9659090909090909
Finally, if the error message is anything other than “”, something went wrong. Sometimes this is something minor, sometimes it’s not. The most common error message indicates an unexpected amino acid was found at a highly conserved position – this can occur if there is a very large N or C terminal deletion, which may be a problem. It can also occur because the highly conserved F / W G x G motif in the J-gene region is occasionally (rarely) modified, which is a less serious issue.
[10]:
print(err_message)
Note that above we used analyze_seq
to analyze a single sequence. If we have a list of sequences we can use analyze_seqs
, or we can loop over our list and feed each sequence to analyze_seq
as we go.
Once we have the numbering, we can use it to extract CDRs using the CDR definitions for an appropriate numbering scheme. Alternatively, we can ask AntPack to tell us what the CDR and framework regions are for that scheme at the same time that we number the sequence. To do so, we just pass the numbering (the first element of alignment
) and the chain (the third element of alignment
) to assign_cdr_labels
, which returns a list which indicates for each numbered position whether it is “-“,
“fmwk1”, “cdr1”, “fmwk2”, “cdr2”, “fmwk3”, “cdr3”, or “fmwk4”.
An example is below:
[11]:
region_labels = sc_annotator.assign_cdr_labels(alignment[0], alignment[2])
print(region_labels)
['-', '-', '-', '-', '-', '-', '-', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4']
In some cases, you may want to number using one scheme, e.g. IMGT, but use a set of CDR definitions from a different scheme, e.g. Kabat. To do this you can pass a third argument scheme
to assign_cdr_labels
. This argument is the scheme whose cdr definitions you would like to use. You can pass north
, kabat
, imgt
, aho
, or martin
. If you leave this argument blank, the definitions used are the ones for the scheme that was used to number the sequence (so if you created
SingleChainAnnotator with scheme imgt
and don’t pass a scheme
argument to assign_cdr_labels
, it keeps using IMGT definitions for the CDR labels as well).
[12]:
region_labels = sc_annotator.assign_cdr_labels(alignment[0], alignment[2], scheme="north")
print(region_labels)
['-', '-', '-', '-', '-', '-', '-', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'fmwk1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'cdr1', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'fmwk2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'cdr2', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'fmwk3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'cdr3', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4', 'fmwk4']
Notice that the list region_labels
is the same length as my_sequence
or as numbering
. This provides an easy way to extract a specific cdr or framework if desired.
[ ]:
Now let’s see how we can use PairedChainAnnotator to number a sequence that contains a paired heavy chain and light chain (in any order). Note that using SingleChainAnnotator will always be faster, because it can afford to make some additional assumptions. We can use PairedChainAnnotator on single chains but this is less efficient.
If you want to use PairedChainAnnotator to look at TCRs, pass the argument receptor_type="tcr"
when constructing it.
[13]:
pc_annotator = PairedChainAnnotator(scheme = "imgt")
[14]:
print(pc_annotator.analyze_seq(paired_chain))
((['-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-'], 0.9659090909090909, 'H', ''), (['-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127'], 0.945054945054945, 'K', ''))
Notice that PairedChainAnnotator returns one annotation for the heavy chain and one for the light. The heavy is always returned first, the light second, regardless of what order they appeared in the sequence. All of the information that is returned is the same as for SingleChainAnnotator. Since the numbering for the heavy and light chain is the same length as the original sequence, it can often be useful to pass each output to pc_annotator.trim_alignment
as illustrated below.
Also notice that just as for SingleChainAnnotator, we can call analyze_seqs
for PairedChainAnnotator
if we have a list of input sequences. In this case, it will return two lists, one of the heavy chain alignments and one of the light.
[15]:
halignment, _ = pc_annotator.analyze_seq(paired_chain)
print(pc_annotator.trim_alignment(paired_chain, halignment))
('EVHLQQSGAELMKPGASVKISCKASGYTFITYWIEWVKQRPGHGLEWIGDILPGSGSTNYNENFKGKATFTADSSSNTAYMQLSSLTSEDSAVYYCARSGYYGNSGFAYWGQGTLVTVSA', ['1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128'], 7, 127)
[ ]:
We can also use PairedChainAnnotator on single chains. It will still return two results. Your clue that only one chain is present will be a low percent identity or an error message for one of the two chains. This is less efficient than SingleChainAnnotator so prefer SCA if you know you have only one chain. Notice below that when we feed in a heavy chain pc_annotator returns an error (which may vary) for the light chain. Low percent identity (e.g. < 0.75) is another strong indicator.
[16]:
print(pc_annotator.analyze_seq(my_sequence))
((['-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128'], 0.9659090909090909, 'H', ''), (['-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-'], -1.0, '', 'Alignment error; chain not found.'))
[ ]:
We can also use a SingleChainAnnotator to extract a specific type of chain from paired chain sequences. Here we take the paired chain and run it through light-chain specific and heavy-chain specific SingleChainAnnotators.
[17]:
sc_heavy = SingleChainAnnotator(["H"], scheme = "imgt")
sc_light = SingleChainAnnotator(["K", "L"], scheme = "imgt")
[18]:
print(sc_heavy.analyze_seq(paired_chain))
(['-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-'], 0.9659090909090909, 'H', '')
[19]:
print(sc_light.analyze_seq(paired_chain))
(['-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127'], 0.945054945054945, 'K', '')
[ ]:
What if we number a list of sequences, some have more positions than others, and now we want to generate a list that contains all of the positions observed in one or more sequences in the correct order? This is useful if for example you want to write all the sequences to say a csv file where each row is the same length, or convert all your sequences to a fixed-length array for a machine learning task (e.g. as input to a gradient boosted trees algorithm).
The easiest way to do this is if your list of sequences is short enough it can fit in memory. In this case, you can just use the build_msa method provided by both Paired and SingleChainAnnotators. If the list of sequences is too large to fit in memory, there’s another way you can do it we’ll illustrate shortly.
When calling build_msa, you can indicate that you want all positions that are expected in a given numbering scheme to be included (True) or only include the ones that are observed in at least one sequence (False). For example, position 10 is expected in IMGT. If no sequence in your dataset has a position 10, passing True would cause all sequences to have a gap here in the resulting MSA, while if False position 10 would not be included.
[20]:
my_hchains = sc_annotator.analyze_seqs(additional_sequences)
numbering, msa = sc_annotator.build_msa(additional_sequences, my_hchains, True)
print(numbering)
print(msa)
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '111A', '112A', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128']
['EVQLVQSGA-EVKKPGESLKISCKGSGYSF----TSYWIGWVRQMPGKGLEWMGIIYPG--DSDTRYSPSFQ-GQVTISADKSISTAYLQWSSLKASDTAMYYCARPHYYD-SLDAFDIWGQGTMVTVSS', '-ITLKESGP-TLVKPTQTLTLTCTFSGFSLS--TSGMGVSWIRQPPGKALEWLAHIYWD---DDKRYNPSLK-SRLTITKDTSKNQVVLTMTNMDPVDTATYYCARLYGF---TYGFAYWGQGTLVTVSS', 'EVKLQESGP-GKLQPSQTLSLTCSFSGFSLT--TSGIGVGWIRQPSGKGLEWLAHIWWS---ASKYYNTALK-SRLTISKDTSNNQVFLKIASVDTADTATYYCARAYYGNYGGYYFDYWGQGTTLTVSS']
Notice that two lists are returned. The first is the position codes (column headings). You could pass these to assign_cdr_labels
if you wanted to find out which regions are inside CDRs. The second is the list of input sequences with blanks added in the right locations so they are all the same length and form an MSA whose column headings are the numbering.
Nice, but…what if the list of sequences we have is too large to fit in memory? AntPack makes it easy to deal with this. First, we’ll number the sequences, then we’ll create a set containing all the positions observed in one or more sequences. AntPack can sort that set so the positions are in the correct order for the numbering scheme we’re using. Then we can easily convert that to a dict and use it to process our sequences as illustrated below. The sequences we’re using here are all heavy chain; if we had both heavy and light, of course, we might want to handle them separately.
[ ]:
[21]:
all_observed_position_codes = set()
numbering = []
for seq in additional_sequences:
results = sc_annotator.analyze_seq(seq)
# We could write the numbering to a temporary file if there are too many sequences to store
#this in memory.
numbering.append(results[0])
if results[3] != "":
# If there is an error message indicating a bad alignment, do something here
pass
# Add all position codes to the set
all_observed_position_codes.update(results[0])
all_observed_position_codes = list(all_observed_position_codes)
print(all_observed_position_codes)
['74', '22', '91', '20', '14', '35', '124', '21', '112A', '78', '98', '113', '127', '5', '104', '4', '15', '115', '81', '48', '94', '122', '75', '18', '118', '28', '84', '57', '97', '41', '87', '90', '96', '26', '80', '30', '46', '126', '25', '103', '45', '95', '36', '7', '79', '67', '70', '92', '2', '110', '111', '63', '27', '119', '13', '9', '65', '39', '102', '100', '16', '88', '109', '117', '38', '6', '85', '99', '3', '40', '31', '112', '62', '69', '1', '101', '86', '19', '37', '71', '58', '83', '52', '111A', '82', '17', '49', '8', '77', '11', '89', '29', '123', '47', '53', '50', '93', '105', '116', '59', '76', '42', '56', '24', '44', '128', '43', '68', '34', '12', '108', '106', '23', '64', '55', '51', '107', '54', '66', '72', '125', '120', '114', '121']
Notice that the list of observed position codes isn’t in order. We can fix that with a single call.
[22]:
all_observed_position_codes = sc_annotator.sort_position_codes(all_observed_position_codes)
print(all_observed_position_codes)
['1', '2', '3', '4', '5', '6', '7', '8', '9', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '111A', '112A', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128']
And now all the position codes are in order. This makes it easy to turn our sequences into a multiple sequence alignment in a file, or encode them as a fixed-length array as illustrated below:
[23]:
position_dict = {k:i for i, k in enumerate(all_observed_position_codes)}
fixed_length_seqs = []
for seq, pos_codes in zip(additional_sequences, numbering):
fixed_length_seq = ["-" for code in all_observed_position_codes]
for letter, code in zip(seq, pos_codes):
fixed_length_seq[position_dict[code]] = letter
fixed_length_seqs.append("".join(fixed_length_seq))
print(fixed_length_seqs)
['EVQLVQSGAEVKKPGESLKISCKGSGYSF--TSYWIGWVRQMPGKGLEWMGIIYPGDSDTRYSPSFQGQVTISADKSISTAYLQWSSLKASDTAMYYCARPHYYD-SLDAFDIWGQGTMVTVSS', '-ITLKESGPTLVKPTQTLTLTCTFSGFSLSTSGMGVSWIRQPPGKALEWLAHIYWD-DDKRYNPSLKSRLTITKDTSKNQVVLTMTNMDPVDTATYYCARLYGF---TYGFAYWGQGTLVTVSS', 'EVKLQESGPGKLQPSQTLSLTCSFSGFSLTTSGIGVGWIRQPSGKGLEWLAHIWWS-ASKYYNTALKSRLTISKDTSNNQVFLKIASVDTADTATYYCARAYYGNYGGYYFDYWGQGTTLTVSS']
[ ]:
[ ]:
[ ]: