Dataset Parsers#

atomworks.ml.datasets.parsers.base.DEFAULT_PARSER_ARGS = {'add_bond_types_from_struct_conn': ['covale'], 'add_id_and_entity_annotations': True, 'add_missing_atoms': True, 'convert_mse_to_met': True, 'fix_arginines': True, 'fix_bond_types': True, 'fix_formal_charges': True, 'fix_ligands_at_symmetry_centers': True, 'hydrogen_policy': 'remove', 'model': None, 'remove_ccds': ['SO4', 'GOL', 'EDO', 'PO4', 'ACT', 'PEG', 'DMS', 'TRS', 'PGE', 'PG4', 'FMT', 'EPE', 'MPD', 'MES', 'CD', 'IOD'], 'remove_waters': True}#: Default cif parser arguments for atomworks.io.parse. This dictionary exists to provide a convenient import for the default parameters.

class atomworks.ml.datasets.parsers.base.MetadataRowParser[source]#

Bases: ABC

Abstract base class for MetadataRowParsers.

A MetadataRowParser is a class that parses a row from a DataFrame on disk into a format digestible by the load_example_from_metadata_row function.

In the common case that a model is trained on multiple datasets, each with their own dataframe and base data format, we must ensure that the data pipeline receives a consistent input format. By way of example, when training an AF-3-style model, we might have a “PDB Chains” dataset of mmCIF files, a “PDB Interfaces” dataset of mmCIF files, and a distillation dataset of computationally-generated PDB files, and many others.

We enforce the following common schema for all datasets:

“example_id”: A unique identifier for the example within the dataset.
“path”: The path to the data file (which we will load with CIFUtils).

WARNING: For many transforms, additional keys are required. For example:

For cropping, the query_pn_unit_iids field is used to center the crop on the interface or pn_unit. If not provided, the AF-3-style crop transforms will crop randomly.
For loading templates, the “pdb_id” is required to load the correct template from disk (at least with the legacy code).

parse(row: Series) → dict[str, Any][source]#: Wrapper to parse and validate a DataFrame row.

required_schema: ClassVar[dict[str, type]] = {'example_id': <class 'str'>, 'extra_info': <class 'dict'>, 'path': <class 'pathlib.Path'>}#

validate_output(output: dict[str, Any]) → None[source]#: Validate the output dictionary for required keys and their types.

atomworks.ml.datasets.parsers.base.load_example_from_metadata_row(metadata_row: Series, metadata_row_parser: MetadataRowParser, *, cif_parser_args: dict | None = None) → dict[source]#

Load training/validation example from a DataFrame row into a common format using the given metadata row parsing function and CIF parser arguments.

Performs the following steps:

Parse the row into a common dictionary format using the provided row parsing function and metadata row.
Load the CIF file from the information in the common dictionary format (i.e., the “path” key).
Combine the parsed row data and the loaded CIF data into a single dictionary.

Parameters:

metadata_row (pd.Series) – The DataFrame row to parse.
metadata_row_parser (MetadataRowParser) – The parser to use for converting the row into a dictionary format.
cif_parser_args (dict, optional) – Additional arguments for the CIF parser. Defaults to None.

Returns:

A dictionary containing the parsed row data and additional loaded CIF data.

Return type:

dict

Row parser for non-standard metadata dataframes

class atomworks.ml.datasets.parsers.custom_metadata_row_parsers.AF2FB_DistillationParser(base_dir: str, file_extension: str = '.cif')[source]#

Bases: MetadataRowParser

DEPRECATION WARNING: This parser is deprecated and will be removed in a future release. We should use the GenericDFParser instead, providing path and example_id columns.

Parser for AF2FB distillation metadata.

The AF2FB distillation dataset is provided courtesy of Meta/Facebook. It contains ~7.6 Mio AF2 predicted structures from UniRef50.

Metadata (i.e. which sequences, which cluster identities @ 30% seq.id, whether a sequence has an msa & template, sequence_hash etc.) are stored in the af2_distillation_facebook.parquet dataframe.

The parquet has the following columns:

example_id
n_atoms
n_res
mean_plddt
min_plddt
median_plddt
sequence_hash
has_msa
msa_depth
has_template
cluster_id
seq (!WARNING: this is a relatively data-heavy column)

class atomworks.ml.datasets.parsers.custom_metadata_row_parsers.ValidationDFParserLikeAF3(base_dir: Path = None, file_extension: str = '.cif.gz')[source]#

Bases: MetadataRowParser

Parser for AF-3-style validation DataFrame rows.

As output, we give:

pdb_id: The PDB ID of the structure.
assembly_id: The assembly ID of the structure, required to load the correct assembly from the CIF file.
path: The path to the CIF file.
example_id: An identifier that combines the pdb_id and assembly_id.
ground_truth: A dictionary containing non-feature information for loss and validation. For validation, we initialize with the following:
- interfaces_to_score: A list of tuples like (pn_unit_iid_1, pn_unit_iid_2, interface_type), which represent low-homology interfaces to score.
- pn_units_to_score: A list of tuples like (pn_unit_iid, pn_unit_type), which represent low-homology pn_units to score.

MetadataRowParser implementations for chain- and interface-based datasets.

class atomworks.ml.datasets.parsers.default_metadata_row_parsers.GenericDFParser(example_id_colname: str = 'example_id', path_colname: str = 'path', pn_unit_iid_colnames: str | list[str] | None = None, assembly_id_colname: str | None = None, base_path: str = '', extension: str = '', attrs: dict | None = None)[source]#

Bases: MetadataRowParser

Generic dataframe parser for training or validation dataframes.

We parse an input row (e.g., a Pandas Series) and return a dictionary containing pertinent information for the Transform pipeline.

Parameters:

example_id_colname – Name of the column containing a unique identifier for each example (across ALL datasets, not just this dataset). By convention, the columns values should be generated with atomworks.ml.common.generate_example_id. Default: “example_id”
path_colname – Name of the column containing paths (relative or absolute) to the relevant structure files. Default: “path”
pn_unit_iid_colnames – The name(s) of the column(s) containing the CIFUtils pn_unit_iid(s); used for cropping. If given as a list, should contain one element for a monomers dataset and two for an interfaces dataset. Default: None (crop randomly)
assembly_id_colname – Optional parameter giving the name of the column containing the assembly ID. If None, the assembly ID will be set to “1” for all examples. Default: None
base_path – The base path to the files, if not included in the path.
extension – The file extension of the structure files, if not included in the path.
attrs – Additional attributes to be merged with the dataframe-level attributes stored in the DF (if present). Attributes in this dictionary will take precedence over those in the dataset-level attributes and will be returned in the “extra_info” key.

Returns:

A dictionary containing:

example_id
The unique identifier for the example. Must be unique across all datasets.

path
The composed path to the structure file, including the base path and extension if specified.

query_pn_unit_iids
The pn_unit_iid(s) that inform where to crop the structure. During TRAINING, we typically want to specify the chain(s) or interface at which to center our crop. If not given (i.e., None), then we will crop the structure at a random location, if a crop is required. During VALIDATION, then we do not crop, and query_pn_unit_iids should be None.

assembly_id
The assembly ID. Used to load the correct assembly from the CIF file. If not given, the assembly ID will be set to “1”.

extra_info
A dictionary containing all additional information that should be passed to the Transform pipeline. Contains, in order of precedence:

Any additional key-value pairs specified by the attrs parameter

All unused dataframe columns (i.e., those not used for example_id, path, query_pn_unit_iids, or assembly_id)

Dataset-level attributes (if present), found in the attrs attribute of the Dataframe (or Series)

For example, the “extra_info” key could contain information about which chain(s) to score during validation, metadata for specific metrics, etc.

Return type:

dict

Note

We must avoid duplication of interfaces due to order inversion. If not using the preprocessing scripts in atomworks.ml, ensure that the interfaces dataframe has been checked for duplicates. For example, [A, B] and [B, A] should be considered the same interface.

Example

Example dataframe: example_id path pn_unit_1_iid pn_unit_2_iid {[‘my-dataset’]}{ex_1}{1}{[A_1,B_1]} /path/to/structure_1.cif A_1 B_1 {[‘my-dataset’]}{ex_2}{2}{[C_1,B_1]} /path/to/structure_2.cif C_1 B_1

class atomworks.ml.datasets.parsers.default_metadata_row_parsers.InterfacesDFParser(base_dir: Path | str | list[Path | str] | tuple[Path | str, ...] = None, file_extension: str | list[str] | tuple[str, ...] = '.cif.gz', path_template: str | list[str] | tuple[str, ...] = '{base_dir}/{pdb_id[1:3]}/{pdb_id}{file_extension}')[source]#

Bases: MetadataRowParser

Parser for interfaces DataFrame rows.

In addition to standard fields (example_id, path), this parser also includes:

The two query pn_unit instance IDs, as a list, which are used to sample the interface during cropping.
The assembly ID, which is used to load the correct assembly from the CIF file.
Any extra information from the DataFrame, which is stored in the extra_info field.

class atomworks.ml.datasets.parsers.default_metadata_row_parsers.PNUnitsDFParser(base_dir: Path | str | list[Path | str] | tuple[Path | str, ...] = None, file_extension: str | list[str] | tuple[str, ...] = '.cif.gz', path_template: str | list[str] | tuple[str, ...] = '{base_dir}/{pdb_id[1:3]}/{pdb_id}{file_extension}')[source]#

Bases: MetadataRowParser

Parser for pn_units DataFrame rows.

In addition to standard fields (example_id, path), this parser also includes:

The query pn_unit instance ID, which is used to center the crop.
The assembly ID, which is used to load the correct assembly from the CIF file.
Any extra information from the DataFrame, which is stored in the extra_info field.

atomworks.ml.datasets.parsers.default_metadata_row_parsers.build_path_from_template(path_template: str, **kwargs) → Path[source]#

Build a path using a template string with variable substitution and slicing.

Parameters:

path_template – Template string with {var} and {var[start:end]} patterns
**kwargs – Variables to substitute

Examples

>>> build_path_from_template("{base_dir}/{pdb_id[1:3]}/{pdb_id}", base_dir="/data", pdb_id="3usg")
Path("/data/us/3usg")

atomworks.ml.datasets.parsers.default_metadata_row_parsers.find_existing_file_path(base_dirs: Sequence[Path | str], file_extensions: Sequence[str], path_templates: Sequence[str], pdb_id: str) → Path[source]#: Find the first existing file path by trying corresponding base_dirs, file_extensions, and path_templates in order.

Dataset Parsers#

This Page