Encoding Definitions#
This module contains definitions for various encoding schemes used in the atomworks.ml package.
Definitions of the various standard encodings.
- atomworks.ml.encoding_definitions.AF2_ATOM14_ENCODING = Encoding(n_tokens=21, n_atoms_per_token=14) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 --------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | NE1 | CE2 | CE3 | CZ2 | CZ3 | CH2 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | 20 : UNK | | | | | | | | | | | | | | #
AF2’s atom14 encoding.
- Reference:
- atomworks.ml.encoding_definitions.AF2_ATOM37_WITH_ATOMIZATION = Encoding(n_tokens=22, n_atoms_per_token=37) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | CB | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 1 : ARG | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | NE | | | | | | NH1 | NH2 | | CZ | | | | OXT 2 : ASN | N | CA | C | CB | O | CG | | | | | | | | | | ND2 | OD1 | | | | | | | | | | | | | | | | | | | | OXT 3 : ASP | N | CA | C | CB | O | CG | | | | | | | | | | | OD1 | OD2 | | | | | | | | | | | | | | | | | | | OXT 4 : CYS | N | CA | C | CB | O | | | | | | SG | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 5 : GLN | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | NE2 | OE1 | | | | | | | | | | OXT 6 : GLU | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | OE1 | OE2 | | | | | | | | | OXT 7 : GLY | N | CA | C | | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 8 : HIS | N | CA | C | CB | O | CG | | | | | | | | CD2 | ND1 | | | | | | CE1 | | | | | NE2 | | | | | | | | | | | OXT 9 : ILE | N | CA | C | CB | O | | CG1 | CG2 | | | | | CD1 | | | | | | | | | | | | | | | | | | | | | | | | OXT 10 : LEU | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | | | | | | | | | | | | | | | | OXT 11 : LYS | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | CE | | | | | | | | | | | | | | | | NZ | OXT 12 : MET | N | CA | C | CB | O | CG | | | | | | | | | | | | | SD | CE | | | | | | | | | | | | | | | | | OXT 13 : PHE | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | | CZ | | | | OXT 14 : PRO | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | | | | | | | | | | | OXT 15 : SER | N | CA | C | CB | O | | | | OG | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 16 : THR | N | CA | C | CB | O | | | CG2 | | OG1 | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 17 : TRP | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | CE2 | CE3 | | NE1 | | | | CH2 | | | | | CZ2 | CZ3 | | OXT 18 : TYR | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | OH | CZ | | | | OXT 19 : VAL | N | CA | C | CB | O | | CG1 | CG2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 20 : UNK | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 21 : 0 | | 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #
AF2’s atom37 encoding with atomization support.
- Reference:
- class atomworks.ml.encoding_definitions.AF3SequenceEncoding[source]#
Bases:
objectEncodes and decodes sequence tokens for AlphaFold 3.
This class provides functionality to convert between residue names and their corresponding integer encodings as used in AlphaFold 3. It handles standard amino acids, RNA, DNA, and unknown residues.
- property idx_to_token: ndarray#
- property n_tokens: int#
- property token_to_idx: dict[str, int]#
- property tokens: list[str]#
- atomworks.ml.encoding_definitions.AF3_TOKENS = ('ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL', 'UNK', 'A', 'C', 'G', 'U', 'N', 'DA', 'DC', 'DG', 'DT', 'DN', '<G>')#
Sequence tokens in AF3
- atomworks.ml.encoding_definitions.NA_ATOM37_ENCODING = Encoding(n_tokens=10, n_atoms_per_token=37) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : DA | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | N6 | | | | | | | | | | | | | | | 1 : DC | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | N4 | | | | | 2 : DG | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | | N2 | O6 | | | | | | | | | | | | | 3 : DT | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | | O4 | C7 | | | 4 : DN | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | | | | | | | | | | | | | 5 : A | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | N6 | | | | | | | | | | | | | | | 6 : C | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | N4 | | | | | 7 : G | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | | N2 | O6 | | | | | | | | | | | | | 8 : U | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | | O4 | | | | 9 : N | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | | | | | | | | | | | | | #
Nucleic acid atom37-like encoding for DNA and RNA.
Provides a unified 37-slot encoding for both DNA and RNA nucleotides, analogous to the protein atom37 encoding. Key features:
Slot 0: P (phosphate backbone)
Slot 1: C1’ (prime) (anomeric carbon - analogous to CA in proteins)
Slot 3: O2’ (prime) (present in RNA, empty in DNA)
Slots 12-23: Purine base atoms (A, G, DA, DG)
Slots 24-33: Pyrimidine base atoms (C, U, T, DC, DT)
No hydrogens included (heavy atoms only)
This encoding ensures that structurally equivalent atoms across different nucleotides occupy the same slot, while maintaining unique positions for purine vs pyrimidine atoms that have different structural roles despite sharing atom names.
- atomworks.ml.encoding_definitions.RF2AA_ATOM36_ENCODING = Encoding(n_tokens=80, n_atoms_per_token=36) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | HE | 1HH1 | 2HH1 | 1HH2 | 2HH2 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD2 | 2HD2 | | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | | | | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | HG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HE2 | 2HE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | | | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | | | | | | | | | | | H | 1HA | 2HA | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 2HD | 1HE | 2HE | | | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | | | | | | | | | | | H | HA | HB | 1HG2 | 2HG2 | 3HG2 | 1HG1 | 2HG1 | 1HD1 | 2HD1 | 3HD1 | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | HG | 1HD1 | 2HD1 | 3HD1 | 1HD2 | 2HD2 | 3HD2 | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | 1HE | 2HE | 1HZ | 2HZ | 3HZ 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HE | 2HE | 3HE | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 2HD | 1HE | 2HE | HZ | | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | | | | | | | | | | | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | | | | | | | | | | | H | HG | HA | 1HB | 2HB | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | | | | | | | | | | | H | HG1 | HA | HB | 1HG2 | 2HG2 | 3HG2 | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 1HE | HZ2 | HH2 | HZ3 | HE3 | | | 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 1HE | 2HE | 2HD | HH | | | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | | | | | | | | | | | H | HA | HB | 1HG1 | 2HG1 | 3HG1 | 1HG2 | 2HG2 | 3HG2 | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 22 : DA | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N6 | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H2 | H61 | H62 | H8 | | 23 : DC | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H42 | H41 | H5 | H6 | | 24 : DG | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N2 | O6 | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H1 | H22 | H21 | H8 | | 25 : DT | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C7 | C6 | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H3 | H71 | H72 | H73 | H6 | 26 : DN | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | | | | | | | | | | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | | | | | | 27 : A | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N3 | C4 | C5 | C6 | N6 | N7 | C8 | N9 | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H2 | H61 | H62 | H8 | | 28 : C | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H42 | H41 | H5 | H6 | | 29 : G | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N2 | N3 | C4 | C5 | C6 | O6 | N7 | C8 | N9 | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H1 | H22 | H21 | H8 | | 30 : U | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C6 | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H3 | H5 | H6 | | | 31 : N | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | | | | | | | | | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | | | | | | 32 : HIS_D | N | CA | C | O | CB | CG | NE2 | CD2 | CE1 | ND1 | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 2HD | 1HE | 1HD | | | | | | 33 : 13 | | 13 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 34 : 33 | | 33 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 35 : 79 | | 79 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 36 : 5 | | 5 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 37 : 4 | | 4 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 38 : 35 | | 35 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 39 : 6 | | 6 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 40 : 20 | | 20 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 41 : 17 | | 17 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 42 : 27 | | 27 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 43 : 24 | | 24 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 44 : 29 | | 29 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 45 : 9 | | 9 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 46 : 26 | | 26 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 47 : 80 | | 80 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 48 : 53 | | 53 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 49 : 77 | | 77 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 50 : 19 | | 19 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 51 : 3 | | 3 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 52 : 12 | | 12 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 53 : 25 | | 25 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 54 : 42 | | 42 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 55 : 7 | | 7 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 56 : 28 | | 28 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 57 : 8 | | 8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 58 : 76 | | 76 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 59 : 15 | | 15 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 60 : 82 | | 82 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 61 : 46 | | 46 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 62 : 59 | | 59 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 63 : 78 | | 78 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 64 : 75 | | 75 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 65 : 45 | | 45 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 66 : 44 | | 44 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 67 : 16 | | 16 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 68 : 51 | | 51 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 69 : 34 | | 34 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 70 : 14 | | 14 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 71 : 50 | | 50 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 72 : 65 | | 65 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 73 : 52 | | 52 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 74 : 92 | | 92 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 75 : 74 | | 74 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 76 : 23 | | 23 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 77 : 39 | | 39 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 78 : 30 | | 30 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 79 : 0 | | 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #
RF2AA all atom encoding for proteins, nucleic acids and various other elements - Encodes heavy atoms and hydrogens (max 36 in total) - Includes 3 unknown tokens: UNK for proteins, DN for dna, N for RNA - Covers:
20 amino acids (+ unknown, + mask),
4 DNA bases (+ unknown),
4 RNA bases (+ unknown),
1 outdated histindine token HIS_D
45 atom tokens (+ unknown)
- atomworks.ml.encoding_definitions.RF2AA_STANDARDIZED_TOKENS = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL', 'UNK', '<M>', 'DA', 'DC', 'DG', 'DT', 'DN', 'A', 'C', 'G', 'U', 'N', 'HIS_D', 13, 33, 79, 5, 4, 35, 6, 20, 17, 27, 24, 29, 9, 26, 80, 53, 77, 19, 3, 12, 25, 42, 7, 28, 8, 76, 15, 82, 46, 59, 78, 75, 45, 44, 16, 51, 34, 14, 50, 65, 52, 92, 74, 23, 39, 30, 0]#
List of standardized tokens in RF2AA.
- atomworks.ml.encoding_definitions.RF2AA_TOKEN_TO_STANDARD_TOKEN = {' DA': 'DA', ' DC': 'DC', ' DG': 'DG', ' DT': 'DT', ' DX': 'DN', ' RA': 'A', ' RC': 'C', ' RG': 'G', ' RU': 'U', ' RX': 'N', 'ALA': 'ALA', 'ARG': 'ARG', 'ASN': 'ASN', 'ASP': 'ASP', 'ATM': 0, 'Al': 13, 'As': 33, 'Au': 79, 'B': 5, 'Be': 4, 'Br': 35, 'C': 6, 'CYS': 'CYS', 'Ca': 20, 'Cl': 17, 'Co': 27, 'Cr': 24, 'Cu': 29, 'F': 9, 'Fe': 26, 'GLN': 'GLN', 'GLU': 'GLU', 'GLY': 'GLY', 'HIS': 'HIS', 'HIS_D': 'HIS_D', 'Hg': 80, 'I': 53, 'ILE': 'ILE', 'Ir': 77, 'K': 19, 'LEU': 'LEU', 'LYS': 'LYS', 'Li': 3, 'MAS': '<M>', 'MET': 'MET', 'Mg': 12, 'Mn': 25, 'Mo': 42, 'N': 7, 'Ni': 28, 'O': 8, 'Os': 76, 'P': 15, 'PHE': 'PHE', 'PRO': 'PRO', 'Pb': 82, 'Pd': 46, 'Pr': 59, 'Pt': 78, 'Re': 75, 'Rh': 45, 'Ru': 44, 'S': 16, 'SER': 'SER', 'Sb': 51, 'Se': 34, 'Si': 14, 'Sn': 50, 'THR': 'THR', 'TRP': 'TRP', 'TYR': 'TYR', 'Tb': 65, 'Te': 52, 'U': 92, 'UNK': 'UNK', 'V': 23, 'VAL': 'VAL', 'W': 74, 'Y': 39, 'Zn': 30}#
Dictionary to interconvert between RF2AA token names and standardized token names.
- atomworks.ml.encoding_definitions.RF2_ATOM14_ENCODING = Encoding(n_tokens=22, n_atoms_per_token=14) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 --------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | #
RF2 atom14 encoding for proteins.
Encodes only the heavy atoms (max 14, for
TRP)Includes 1 unknown tokens:
UNK
Print it out to see a visual representation of the encoding.
- atomworks.ml.encoding_definitions.RF2_ATOM23_ENCODING = Encoding(n_tokens=32, n_atoms_per_token=23) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 --------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | | | | | | | | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | | | | | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | | | | | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | | | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | | | | | | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | | | | | | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | | | | | | | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | | | | | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | | | | | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | | | | | | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | | | | | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | | | | | | | | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | | | | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | | | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | | | | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 | | | | | | | | | 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | | | | | | | | | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | | | | | | | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 22 : DA | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N6 | | 23 : DC | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | 24 : DG | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N2 | O6 | 25 : DT | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C7 | C6 | | | 26 : DN | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | | | | | | | | | | | | 27 : A | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N3 | C4 | C5 | C6 | N6 | N7 | C8 | N9 | 28 : C | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | 29 : G | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N2 | N3 | C4 | C5 | C6 | O6 | N7 | C8 | N9 30 : U | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C6 | | | 31 : N | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | | | | | | | | | | | #
RF2 atom23 encoding for proteins and nucleic acids.
Encodes only the heavy atoms (max 22, for
RG)Includes 3 unknown tokens:
UNKfor proteins,DNfor dna,Nfor RNA
Print it out to see a visual representation of the encoding.
- class atomworks.ml.encoding_definitions.TokenEncoding(token_atoms: dict[str | int, ndarray], chemcomp_type_to_unknown: dict[str, str] = None)[source]#
Bases:
objectA class to represent a fixed length token encoding.
- Parameters:
token_atoms – A dictionary mapping token names to atom names. The order of the tokens in the sequence determines the integer encoding of the token. The order of the atom names in the tuple determines the integer encoding of the atom name within the token.
chemcomp_type_to_unknown – A dictionary mapping chemical component types to unknown token names. This is used to map unknown residues to the respective unknown token. Different chemical component types may map to different unknown token names. Defaults to
{}, meaning that no unknown tokens are defined, leading to aKeyErrorif an unknown residue is encountered.
Note
We follow these conventions for tokens to make them compatible with the CCD for robust and easy tokenization. If you want to use the Transforms written for automatically tokenizing and encoding, you need to follow these conventions:
- When encoding a residue, we use the standardized (up to) 3-letter residue name from the CCD,
e.g.
'ALA'for Alanine, or'DA'for Deoxyadenosine, or'U'for Uracil.
- When encoding unknown tokens, we may define different unknown tokens for different
chemical components (e.g. a different unknown for proteins, vs. dna, …). The unknown tokens can take on any arbitrary 3-letter code that we want to map to, but they should not clash with existing residue names in the CCD.
- When encoding an atom, we use the atomic number of the element as a string as the
token name. E.g.
'1'for Hydrogen,'6'for Carbon,'9'for Fluorine, … For unknown atoms, we use'0'as the token name. # TODO: Deal with ligand names such as'100'which is also an atomic number
- To denote masked tokens, we use a
'<...>'syntax. E.g.'<M>'for a generic mask token, or
'<MP>'for a mask token for proteins. The … can be any arbitrary string. We use the angle brackets to avoid clashes with existing residue names in the CCD.
- To denote masked tokens, we use a
- property atom_to_idx: dict[tuple[str | int, str], int]#
For encoding atoms (token, atom) to atom indices. (token, atom) -> atom_idx
- chemcomp_type_to_unknown: dict[str, str] = None#
- property idx_to_atom: ndarray#
For rapid decoding of token & atom indices to atom names via numpy indexing.
- property idx_to_element: ndarray#
For rapid decoding of token & atom indices to atom names via numpy indexing.
- property idx_to_token: ndarray#
For rapid decoding of token indices to token names via numpy indexing.
- property n_atoms_per_token: int#
- property n_tokens: int#
- token_atoms: dict[str | int, ndarray]#
- property token_to_idx: dict[str, int]#
For encoding token names to token indices. (token) -> token_idx
- property tokens: ndarray#
- property unknown_tokens: ndarray#
- atomworks.ml.encoding_definitions.UNIFIED_ATOM37_ENCODING = Encoding(n_tokens=33, n_atoms_per_token=37) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : <M> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 : ALA | N | CA | C | CB | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 2 : ARG | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | NE | | | | | | NH1 | NH2 | | CZ | | | | OXT 3 : ASN | N | CA | C | CB | O | CG | | | | | | | | | | ND2 | OD1 | | | | | | | | | | | | | | | | | | | | OXT 4 : ASP | N | CA | C | CB | O | CG | | | | | | | | | | | OD1 | OD2 | | | | | | | | | | | | | | | | | | | OXT 5 : CYS | N | CA | C | CB | O | | | | | | SG | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 6 : GLN | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | NE2 | OE1 | | | | | | | | | | OXT 7 : GLU | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | OE1 | OE2 | | | | | | | | | OXT 8 : GLY | N | CA | C | | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 9 : HIS | N | CA | C | CB | O | CG | | | | | | | | CD2 | ND1 | | | | | | CE1 | | | | | NE2 | | | | | | | | | | | OXT 10 : ILE | N | CA | C | CB | O | | CG1 | CG2 | | | | | CD1 | | | | | | | | | | | | | | | | | | | | | | | | OXT 11 : LEU | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | | | | | | | | | | | | | | | | OXT 12 : LYS | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | CE | | | | | | | | | | | | | | | | NZ | OXT 13 : MET | N | CA | C | CB | O | CG | | | | | | | | | | | | | SD | CE | | | | | | | | | | | | | | | | | OXT 14 : PHE | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | | CZ | | | | OXT 15 : PRO | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | | | | | | | | | | | OXT 16 : SER | N | CA | C | CB | O | | | | OG | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 17 : THR | N | CA | C | CB | O | | | CG2 | | OG1 | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 18 : TRP | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | CE2 | CE3 | | NE1 | | | | CH2 | | | | | CZ2 | CZ3 | | OXT 19 : TYR | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | OH | CZ | | | | OXT 20 : VAL | N | CA | C | CB | O | | CG1 | CG2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 21 : UNK | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 22 : A | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | N6 | | | | | | | | | | | | | | | 23 : C | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | N4 | | | | | 24 : G | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | | N2 | O6 | | | | | | | | | | | | | 25 : U | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | | O4 | | | | 26 : N | P | C1' | C2' | O2' | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | | | | | | | | | | | | | 27 : DA | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | N6 | | | | | | | | | | | | | | | 28 : DC | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | N4 | | | | | 29 : DG | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | N9 | C8 | N7 | C5 | C4 | N3 | C2 | N1 | C6 | | N2 | O6 | | | | | | | | | | | | | 30 : DT | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | N1 | C2 | O2 | N3 | C4 | C5 | C6 | | O4 | C7 | | | 31 : DN | P | C1' | C2' | | C3' | O3' | C4' | O4' | C5' | O5' | OP1 | OP2 | | | | | | | | | | | | | | | | | | | | | | | | | 32 : <A> | | X | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #
Unified atom37 encoding for all token types in ConditionalResidueTypeSeqFeat.
Provides a comprehensive 37-slot encoding that encompasses: - Class 0: MASK token (special masking token) - Classes 1-20: Standard amino acids (ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE,
LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR, VAL)
Class 21: UNK (unknown amino acid)
Classes 22-25: RNA nucleotides (A, C, G, U)
Class 26: N (unknown RNA)
Classes 27-30: DNA nucleotides (DA, DC, DG, DT)
Class 31: DN (unknown DNA)
Class 32: ATOMIZED (atomized small molecule token)
This encoding is compatible with the conditional residue type feature used in protein foundation models, enabling unified handling of proteins, RNA, DNA, and small molecules in a single representation space.
- Usage:
UNIFIED_ATOM37_ENCODING serves as the single source of truth for: - Atom37 layout operations (coordinate processing):
atom_array_to_encoding() / atom_array_from_encoding()
Converting between AtomArray and atom37 coordinate tensors
- Sequence encoding operations (residue type indices):
Use UNIFIED_ATOM37_ENCODING.token_to_idx to encode residue names
Use UNIFIED_ATOM37_ENCODING.idx_to_token to decode indices
- atomworks.ml.encoding_definitions.UNKNOWN_ELEMENT_TOKEN = 0#
The token to use for an unknown element.