DataFrames And Results

FRUST uses pandas dataframes as the handoff between workflow stages.

That means most FRUST outputs are ordinary tables. Each row is a structure or conformer, and each calculation stage adds new columns to the same table. This is useful because you can inspect results with normal pandas commands instead of learning a special result format.

The short version is:

rows = structures or conformers
columns = metadata, coordinates, energies, status flags, and saved outputs

A Tiny Example Table

Imagine FRUST has embedded two conformers for the same molecule. Before any calculation, the dataframe might look conceptually like this:

substrate_name	structure_type	rpos	cid	atoms	coords_embedded
anisole	MOL	2	0	`["C", "H", ...]`	`[(x, y, z), ...]`
anisole	MOL	2	1	`["C", "H", ...]`	`[(x, y, z), ...]`

After an xTB optimization called xtb_opt, FRUST adds stage-prefixed columns:

substrate_name	cid	xtb_opt-NT	xtb_opt-EE	xtb_opt-oc
anisole	0	`True`	`-42.1`	optimized coordinates
anisole	1	`True`	`-42.4`	optimized coordinates

The row is still the same conformer. The new columns tell you what happened during the calculation stage.

What One Row Means

Most rows represent one structure or conformer at a particular point in the workflow.

One ligand can generate several reactive positions. Each reactive position can generate several conformers. Each conformer can pass through several calculation stages. FRUST keeps those possibilities visible as rows and columns.

Common identity columns include:

substrate_name: the ligand or substrate identity;
structure_type: for example MOL, TS1, TS2, or INT3;
molecule_role: for example ligand, transition state, or intermediate role;
rpos: reactive position;
cid: conformer id.

These columns are not just labels. FRUST uses them for grouping, especially when keeping only the lowest-energy conformers.

Coordinates

Most Stepper stages need:

atoms: element symbols;
one coordinate column.

The first coordinate column is often:

coords_embedded

After an optimization, the optimized coordinates are stored in a column ending with:

-oc

For example:

xtb_opt-oc
gxtb_preopt-oc
orca_opt-oc

When you run the next calculation stage, Stepper automatically uses the most recent coordinate column. This lets a workflow move naturally from embedded coordinates to xTB optimization, then ORCA refinement.

Stage Names And Suffixes

Every calculation stage has a prefix. You usually choose it with name=.

df = step.xtb(
    df,
    name="xtb_opt",
    options={"gfn": 2, "opt": None},
)

This produces columns such as:

xtb_opt-EE
xtb_opt-NT
xtb_opt-oc

The common suffixes are:

Suffix	Meaning	First thing to do with it
`-NT`	Normal termination	Filter failed rows
`-EE`	Electronic energy	Rank conformers or structures
`-GE`	Gibbs energy	Compare thermochemistry when available
`-oc`	Optimized coordinates	Use as input to the next stage
`-vibs`	Vibrations	Inspect frequency jobs
`-error`	Row-level exception text	Debug failed rows

Mini-Tutorial: Inspect A Result DataFrame

Start by loading a parquet file:

import pandas as pd

df = pd.read_parquet("runs/example.parquet")

Look at the columns:

print(df.columns.tolist())

Find calculation status columns:

nt_cols = [col for col in df.columns if col.endswith("-NT")]
print(nt_cols)

Keep only rows where the final stage succeeded:

final_nt = nt_cols[-1]
df_ok = df[df[final_nt]]

Find energy columns:

energy_cols = [col for col in df.columns if col.endswith("-EE")]
print(energy_cols)

Sort by the latest energy:

final_energy = energy_cols[-1]
df_ranked = df_ok.sort_values(final_energy)

Inspect the best few rows:

df_ranked[
    ["substrate_name", "structure_type", "rpos", "cid", final_energy]
].head()

This is often the first useful analysis after a workflow finishes.

Mini-Tutorial: Keep The Lowest Conformers

Many Stepper methods accept lowest=....

df = step.xtb(
    df,
    name="xtb_opt",
    options={"gfn": 2, "opt": None},
    lowest=5,
)

This tells FRUST to group rows by available structure identity columns, then keep up to five low-energy conformers per group after the stage finishes.

In practical terms, this means:

many conformers -> cheap optimization -> keep the best few -> expensive stage

That is the normal screening pattern. Run cheap calculations broadly, then spend expensive ORCA time only on the most relevant rows.

Failed Rows

FRUST tries not to abort an entire dataframe because one row fails. Instead, it stores failure information in stage-specific columns.

For a stage named gxtb_opt, look for:

gxtb_opt-NT
gxtb_opt-error

Example:

failed = df[df["gxtb_opt-NT"] == False]
failed[["substrate_name", "cid", "gxtb_opt-error"]].head()

Use -error first. Only dig into saved calculation files if the error message does not explain the problem.

Step Metadata

Stepper stores a record of the stages in:

df.attrs["frust_steps"]

Example:

for name, meta in df.attrs.get("frust_steps", {}).items():
    print(name, meta)

This can tell you which engine was used, what options were passed, and whether special routes such as UMA or g-xTB were active.

This metadata is useful when you come back to an old parquet file and need to remember how it was produced.

Parquet Outputs

FRUST workflows commonly write parquet files because they preserve dataframe columns efficiently.

A typical analysis loop is:

submit or run a workflow;
collect parquet outputs;
load them with pandas;
filter on -NT;
sort or group by -EE;
inspect coordinates or saved files for the most interesting rows.

If a submitit run produces many parquet files, use the packaged command:

merge_parquet --input-dir runs/example --output merged.parquet --recursive

Then load the merged result:

df = pd.read_parquet("merged.parquet")

Schema Helpers

For quick scripts, plain pandas is often enough. For reusable analysis code, FRUST also provides helpers:

from frust.schema import energy_columns, normal_termination_columns, normalize_dataframe

df = normalize_dataframe(df)
energies = energy_columns(df)
nt_cols = normal_termination_columns(df)

These helpers are useful when comparing older parquet files with newer results, because they normalize legacy column names and locate common output columns.