## Requirements
The following python packages are minimally required to use the functionality of this notebook, as we show later.
Installing `ase` can extend the functionalities of this notebook.

In [None]:
!pip install pandas pyarrow pymatgen fsspec

In the following, we'll retrieve all MatPES-r<sup>2</sup>SCAN entries in the Li-Ni-O chemical system. The `parquet` format allows us to only retrieve a subset of data and load it into memory, without downloading the full dataset.

To load the full dataset, which includes PBE entries as well, remove the `filters` kwarg.

In [None]:
import pandas as pd

matpes_subset = pd.read_parquet(
    "s3://materialsproject-contribs/MatPES_2025_1/MatPES-2025.1.parquet",
    filters = [
        ("functional", "==", "r2SCAN"),
        ("chemsys", "==", "Li-Ni-O")
    ]
)

Some interatomic potentials require one-body energies, which may be the energies of the isolated atoms. Additionally, to create the `cohesive_energy_per_atom` tag in the dataset, the energies of isolated atoms are needed.

To load these energies, run the following block. Similar filters apply to this dataset.

In [None]:
matpes_all_atom_refs = pd.read_parquet(
    "s3://materialsproject-contribs/MatPES_2025_1/MatPES-atoms-2025.1.parquet",
)

The following block will append the structure data in the parquet file in the `pymatgen` `Structure` format. The parquet file includes a code-agnostic, unambiguous format for structural information: the 3D `cell` vectors <b>a</b>, <b>b</b>, <b>c</b>, the atomic (proton or <it>Z</it>) numbers at each basis site, and the Cartesian coordinates of the atomic basis.

In [None]:
from pymatgen.core import Element, Structure

def create_pymatgen_structures(data_frame : pd.DataFrame) -> pd.DataFrame:
    def _create_pymatgen_structure(row : pd.Series) -> Structure:
        return Structure(
                row.cell.tolist(),
                [Element.from_Z(z) for z in row.atomic_numbers],
                row.cart_coords,
                coords_are_cartesian=True,
                site_properties={"magmom": row.magmoms} if row.magmoms is not None else None,
            )
    data_frame["structure"] = data_frame.apply(_create_pymatgen_structure,axis=1)
    return data_frame

matpes_subset = create_pymatgen_structures(matpes_subset)

In [None]:
matpes_subset.structure[0]

The next block will append `ase` `Atoms` objects to the `DataFrame`. You will need `ase` to use this functionality

In [None]:
!pip install ase

In [None]:
from ase.calculators.singlepoint import SinglePointCalculator
from ase import Atoms
import numpy as np

def create_ase_atoms(data_frame : pd.DataFrame) -> pd.DataFrame:
    def _create_ase_atoms(row : pd.Series) -> Atoms: 
        atoms = Atoms(
            positions=np.array(row.cart_coords.tolist()),
            numbers=row.atomic_numbers,
            cell=np.array(row.cell.tolist()),
        )
        calc = SinglePointCalculator(
            atoms,
            energy = row.energy,
            **{
                k: np.array(getattr(row, k, None).tolist())
                for k in {"forces", "stress", "magmoms"}
            },
        )
        atoms.calc = calc
        return atoms
    data_frame["atoms"] = data_frame.apply(_create_ase_atoms,axis=1)
    return data_frame

matpes_subset = create_ase_atoms(matpes_subset)

In [None]:
matpes_subset.atoms[0]