Algorithm Overview
This document describes the pipeline from raw molecular geometries to clustered dimensionality reduction maps.
Pipeline Summary
XYZ files → Connectivity Analysis → Family Grouping → Alignment → Feature Extraction → Dim. Reduction → Visualization
Step 1: Parse Molecular Geometries
INPUT: XYZ files (one per geometry)
FOR each xyz_file:
n_atoms ← first line
coords ← parse atom positions (n_atoms × 3 matrix)
elements ← parse atom symbols
STORE (elements, coords)
Each XYZ file contains one molecular snapshot with atomic coordinates.
Step 2: Connectivity Analysis
FOR each geometry (elements, coords):
mol ← build_molecule(elements, coords)
# Infer bonds from distances
FOR each atom pair (i, j):
IF distance(i, j) < covalent_threshold:
add_bond(mol, i, j)
smiles ← canonical_smiles(mol)
family[smiles].append(geometry)
The SMILES string encodes the molecular connectivity (which atoms are bonded). Geometries with the same SMILES belong to the same family.
Step 3: Alignment (Kabsch with Permutation)
For each family, align all geometries to a reference centroid structure.
FOR each family:
centroid ← load_reference_structure(family)
FOR each geometry in family:
best_rmsd ← infinity
best_aligned ← None
# Try all permutations of equivalent atoms
FOR each permutation P of equivalent atoms:
permuted_coords ← apply_permutation(coords, P)
# Kabsch algorithm
centered_ref ← centroid - centroid_mean
centered_geo ← permuted_coords - permuted_mean
H ← centered_geo.T @ centered_ref
U, S, Vt ← SVD(H)
R ← Vt.T @ U.T # optimal rotation
aligned ← (permuted_coords - mean) @ R + centroid_mean
rmsd ← sqrt(mean(||aligned - centroid||²))
IF rmsd < best_rmsd:
best_rmsd ← rmsd
best_aligned ← aligned
STORE (best_aligned, best_rmsd)
The permutation search handles atom equivalence (e.g., swapping two hydrogens on the same carbon).
Step 4: Feature Extraction
Two feature representations are available:
Option A: Aligned Cartesian Coordinates
FOR each aligned geometry (n_atoms × 3):
features ← flatten(coords) # → vector of length 3×n_atoms
For ethylene (6 atoms): 18-dimensional feature vector.
Option B: Inverse Distance Matrix
FOR each aligned geometry:
features ← []
FOR each atom pair (i, j) where i < j:
r ← distance(atom_i, atom_j)
features.append(1/r)
features ← clip(features, 0, 100) # handle near-zero distances
For ethylene (6 atoms): 15-dimensional feature vector (6×5/2 pairs).
Step 5: Filtering (Optional)
Remove "exploded" geometries where atoms have separated too far.
threshold ← 5.0 Å # max allowed pairwise distance
FOR each geometry:
max_dist ← max(pairwise_distances(coords))
IF max_dist > threshold:
DISCARD geometry
Step 6: Dimensionality Reduction
Reduce high-dimensional features to 2D for visualization.
# Preprocessing
scaler ← fit(all_features) # zero mean, unit variance
scaled_features ← scaler.transform(features)
# Choose one method:
embedding ← PCA(scaled_features, n_components=2)
OR TSNE(scaled_features, n_components=2)
OR UMAP(scaled_features, n_components=2)
OR DiffusionMap(scaled_features, n_components=3)[1:] # skip trivial component
Method Notes
| Method | Preserves | Best For |
|---|---|---|
| PCA | Global variance | Linear relationships |
| t-SNE | Local neighborhoods | Cluster separation |
| UMAP | Local + some global | General purpose |
| Diffusion Map | Intrinsic geometry | Reaction pathways |
Step 7: Visualization
FOR each family:
color ← family_color_map[family]
PLOT points (embedding[:, 0], embedding[:, 1]) with color
IF centroid exists:
centroid_embedding ← reduce(centroid_features)
PLOT star marker at centroid_embedding
ENABLE hover tooltips showing xyz filename
Output
Interactive HTML dashboard where:
- Each point = one molecular geometry
- Color = molecular family (connectivity pattern)
- Star markers = reference centroid structures
- Hover = xyz filename for inspection
- Dropdowns = switch between threshold/features/method