Data Mirrors#

AtomWorks uses local mirrors of the PDB and CCD databases for parsing structures and training models. This page explains how to set up these mirrors.

Setting Up Mirrors#

PDB Mirror (~100 GB)#

The PDB mirror contains mmCIF structure files. We use mmCIF as the recommended format.

# Download the entire PDB
atomworks pdb sync /path/to/pdb_mirror

# Or download specific PDB IDs only
atomworks pdb sync /path/to/pdb_mirror --pdb-id 1A0I --pdb-id 7XYZ

# Or from a file of IDs (one per line)
atomworks pdb sync /path/to/pdb_mirror --pdb-ids-file /path/to/ids.txt

This creates a carbon-copy of the PDB with the same sharding pattern as RCSB (e.g., 1a2b/path/to/pdb_mirror/a2/1a2b.cif.gz).

CCD Mirror (~2 GB)#

The CCD (Chemical Component Dictionary) mirror contains ligand definitions used for parsing non-polymer entities.

atomworks ccd sync /path/to/ccd_mirror

If no CCD mirror is provided, AtomWorks falls back to Biotite’s internal CCD. You can also add custom ligand definitions by placing CIF files in the mirror following the CCD pattern (e.g., /path/to/ccd_mirror/M/MYLIGAND/MYLIGAND.cif).

Configuring Environment Variables#

Once mirrors are created, configure the paths in a .env file in your repository root:

PDB_MIRROR_PATH=/path/to/pdb_mirror
CCD_MIRROR_PATH=/path/to/ccd_mirror

You can copy .env.sample as a starting point. See Installation for more details on environment setup.


Training on the PDB#

Step 1 — Mirror the PDB (mmCIFs)#

To train on the PDB, you need access to the structure files. Follow the instructions above to set up your PDB mirror.

Step 2 — Get PDB metadata (PN units and interfaces)#

To calculate sampling probabilities and filter examples for splits, we pre-process the PDB with metadata for each PDB entry. To save you the work, we provide pre-computed metadata (dated July 15/2025) for downloading:

atomworks setup metadata /path/to/metadata  # This will download the metadata (as .tar.gz) and extract it to the specified directory.

This produces parquet files at:

  • /path/to/metadata/pn_units_df.parquet — Contains metadata for each PN unit in the PDB. The term pn unit is shorthand for polymer XOR non-polymer unit and behaves for almost all purposes like the chain in a PDB file. The only difference is that a ligand composed of multiple covalently bonded ligands is considered a single PN unit (whilst it would be multiple chains in a PDB file). Effectively this .parquet is a large table of all individual chains, ligands, etc (to be precise, it has one entry per pn unit) in the PDB that includes helpful metadata for filtering and sampling.

  • /path/to/metadata/interfaces_df.parquet — Contains metadata for each interface in the PDB. This .parquet is a large table of all binary interfaces in the PDB. It lists each interface as (pn_unit_1, pn_unit_2) pairs and includes helpful metadata for filtering and sampling.

    Alternatively, you can generate fresher metadata yourself (scripts will be uploaded in the coming weeks).

Step 3 — Configure an AF3-style dataset (example: train only on D-polypeptides)#

Next we need to use the metadata to configure a dataset that we would like to sample from. This includes e.g. training cut-off, filters, transforms to apply, etc. Here’s a simple example that:

  • Filters to D-polypeptide and L-polypeptide chains only (POLYPEPTIDE_D and POLYPEPTIDE_L – to include additional chain types, replace the lists with the appropriate IDs (see [mapping](./src/atomworks/enums.py#L31-L45) in comments).

  • Excludes ligands in the AF3 list of excluded ligands, available at [atomworks.io.constants.AF3_EXCLUDED_LIGANDS_REGEX](./src/atomworks/constants.py#L350).

# NOTE: The below is a hydra config and the _target_ fields are the hydra syntax for instantiating a class.
#  You can use this without hyrda, but will then instead need to provide the corresponding arguments for the
#  _target_ objects directly.

 Chain type ids used below (from atomworks.enums.ChainType):
# 0=CyclicPseudoPeptide, 1=OtherPolymer, 2=PeptideNucleicAcid,
# 3=DNA, 4=DNA_RNA_HYBRID, 5=POLYPEPTIDE_D, 6=POLYPEPTIDE_L, 7=RNA,
# 8=NON_POLYMER, 9=WATER, 10=BRANCHED, 11=MACROLIDE

af3_pdb_dataset:
  _target_: atomworks.ml.datasets.datasets.ConcatDatasetWithID
  datasets:
    # Single PN units
    - _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
      dataset_parser:
        _target_: atomworks.ml.datasets.parsers.PNUnitsDFParser
      transform:
        _target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
        is_inference: false
        n_recycles: 5  # This means that we will subsample 5 random sets from the MSA for each example.
        crop_size: 256
        crop_contiguous_probability: 0.3333333333333333
        crop_spatial_probability: 0.6666666666666666
        diffusion_batch_size: 32
        # Optional templates (if available)
        template_lookup_path: ${paths.shared}/template_lookup.csv
        template_base_dir: ${paths.shared}/template
        # Optional MSAs (see Step 4)
        # protein_msa_dirs:
        #   - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
        # rna_msa_dirs:
        #   - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
      dataset:
        _target_: atomworks.ml.datasets.datasets.PandasDataset
        name: pn_units
        id_column: example_id
        data: /path/to/metadata/pn_units_df.parquet
        filters:
          - "deposition_date < '2022-01-01'"
          - "resolution < 5.0 and ~method.str.contains('NMR')"
          - "num_polymer_pn_units <= 20"
          - "cluster.notnull()"
          - "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
          # Train only on D-polypeptides:
          - "q_pn_unit_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
          # Exclude ligands from AF3 excluded set:
          - "~(q_pn_unit_non_polymer_res_names.notnull() and q_pn_unit_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
        columns_to_load: null
      save_failed_examples_to_dir: null

    # Binary interfaces
    - _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
      dataset_parser:
        _target_: atomworks.ml.datasets.parsers.InterfacesDFParser
      transform:
        _target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
        is_inference: false
        n_recycles: 5
        crop_size: 256
        crop_spatial_probability: 1.0
        crop_contiguous_probability: 0.0
        diffusion_batch_size: 32
        template_lookup_path: ${paths.shared}/template_lookup.csv
        template_base_dir: ${paths.shared}/template
        # Optional MSAs (see Step 4)
        # protein_msa_dirs:
        #   - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
        # rna_msa_dirs:
        #   - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
      dataset:
        _target_: atomworks.ml.datasets.datasets.PandasDataset
        name: interfaces
        id_column: example_id
        data: /path/to/metadata/interfaces_df.parquet
        filters:
          - "deposition_date < '2022-01-01'"
          - "resolution < 5.0 and ~method.str.contains('NMR')"
          - "num_polymer_pn_units <= 20"
          - "cluster.notnull()"
          - "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
          # Train only on D-polypeptide interfaces:
          - "pn_unit_1_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
          - "pn_unit_2_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
          - "~(pn_unit_1_non_polymer_res_names.notnull() and pn_unit_1_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
          - "~(pn_unit_2_non_polymer_res_names.notnull() and pn_unit_2_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
        columns_to_load: null
      cif_parser_args:
        cache_dir: null
      save_failed_examples_to_dir: null

Step 4 — Train a model#

You now have a full fledged dataset that you can use to train models on! If you want to just try this out without having to download the whole PDB and the metdatada, you can instead run our tests which have a mini-mockup of the pipeline with real pdb files, metadata, distillation data, templates and MSAs for the example of AF3. You can download all this relevant metadata via the atomworks CLI:

Note

Make sure you are in the AtomWorks root directory when you run the following command, otherwise a new tests/data folder will be created in your current working directory.

You will now have a mini PDB at tests/data/pdb and a mini custom CCD at tests/data/ccd. The distillation and metadata are in data/ml/af2_distillation, data/ml/pdb_pn_units and data/ml/pdb_interfaces. A dataset that uses all of these is [for example here](./tests/ml/conftest.py#L300).

To run the tests for the various datasets, you can run the following command:

# Make sure you have the correct environment activated, and set your paths correctly in the .env file / shell environment variables (see points above)
pytest tests/ml/pipelines/test_data_loading_pipelines.py