Dataset Open Access

proteinNet3D

Li, Rui; Yushkevich, Artsemi; Kudryashev, Misha; Yakimovich, Artur

ProteinNet3D is a curated large-scale dataset of 3D macromolecular density volumes designed to support representation learning and benchmarking in structural biology. The dataset is derived from the publicly available Electron Microscopy Data Bank (EMDB), a comprehensive repository of experimentally determined cryo-electron microscopy (cryo-EM) maps spanning diverse macromolecules, molecular assemblies, and subcellular structures.

ProteinNet3D focuses specifically on individual macromolecules resolved by single-particle analysis (SPA) or subtomogram averaging (STA), ensuring methodological consistency across samples. To emphasize biologically meaningful structures while avoiding extreme cases, entries were restricted to a molecular weight range of 100–1500 kDa. This criterion excludes small domains and excessively large complexes, resulting in a dataset well-suited for learning size-robust structural representations.

All volumes are standardized through isotropic resampling, spatial normalization to a fixed grid (64³ voxels), and intensity normalization to zero mean and unit variance. Background regions are masked using annotated contour levels to reduce noise contributions. To enhance diversity and rotational invariance, each structure is augmented with multiple random 3D rotations.

Overall, ProteinNet3D comprises 26,110 processed samples and captures substantial structural heterogeneity, experimental variability, and realistic noise characteristics, making it a rigorous benchmark for 3D deep learning in cryo-EM.

Files (25.2 GB)
Name Size
test_202212.npz
md5:c2ca2945e73a5eaa970940806fc542af
2.5 GB Download
train_202212.npz
md5:adb8edf49732c0f661a146589b53552b
20.1 GB Download
val_202212.npz
md5:45de8755b7ddbb4ccbaffd5cda4c6615
2.5 GB Download
40
8
views
downloads
All versions This version
Views 4040
Downloads 88
Data volume 73.0 GB73.0 GB
Unique views 3535
Unique downloads 77

Share

Cite as