Getting Started#
This page walks a new user from a fresh checkout to a working Hyena forward pass. For the full installation matrix (dev container, Docker, Apptainer, conda, venv) see the project README.
Requirements#
CUDA-compatible NVIDIA GPU (Ampere or newer)
CUDA Toolkit 12.0 or higher
Python 3.11 or higher
The optional fused RMSNorm kernel (quack-kernels) requires Hopper or
Blackwell (H100, B200, B300); on Ampere the library falls back to a
pure-PyTorch path automatically.
Install#
The recommended developer setup is conda:
bash setup_conda_env.sh
conda activate nvsubquadratic
This creates an environment with Python 3.12 and PyTorch 2.10 (CUDA
12.9), installs the dev dependencies, builds NVIDIA Apex from source,
and installs quack-kernels.
For an alternative venv-based install:
python3 -m venv venv
source venv/bin/activate
pip install torch==2.10.0 torchvision==0.25.0 \
--index-url https://download.pytorch.org/whl/cu129
pip install -r requirements-dev.txt
pip install --no-build-isolation -e .
Docker, Apptainer, enroot/SLURM, and dev-container instructions live in the project README.
Hello, Hyena#
A minimal forward pass through a 2D Hyena mixer:
import torch
from nvsubquadratic.lazy_config import LazyConfig, instantiate
from nvsubquadratic.modules.hyena_nd import Hyena
from nvsubquadratic.modules.kernels_nd import (
SIRENKernelND,
SIRENPositionalEmbeddingND,
)
from nvsubquadratic.ops.fftconv import fftconv2d_fp32_bhl
device = torch.device("cuda")
B, H, X, Y = 2, 64, 32, 32
x = torch.randn(B, H, X, Y, device=device)
# A SIREN-parameterised long-range 2D kernel.
kernel_cfg = LazyConfig(SIRENKernelND)(
out_dim=H,
data_dim=2,
mlp_hidden_dim=64,
num_layers=3,
embedding_dim=32,
omega_0=10.0,
L_cache=max(X, Y),
use_bias=True,
)
# Wire a Hyena mixer that consumes the kernel via a global FFT conv.
mixer_cfg = LazyConfig(Hyena)(
global_conv_cfg=LazyConfig(lambda: None)(), # replaced below
short_conv_cfg=LazyConfig(torch.nn.Identity)(),
gate_nonlinear_cfg=LazyConfig(torch.nn.SiLU)(),
pixelhyena_norm_cfg=LazyConfig(torch.nn.Identity)(),
qk_norm_cfg=None,
)
# For a self-contained example, skip the LazyConfig dance and call the
# op directly:
kernel = torch.randn(1, H, X, Y, device=device)
y = fftconv2d_fp32_bhl(x, kernel)
print(y.shape) # torch.Size([2, 64, 32, 32])
The lower-level FFT ops in
nvsubquadratic.ops are deliberately
function-only so higher-level mixers can compose them freely. The
nvsubquadratic.modules package wraps them
in nn.Module-shaped mixers (Hyena, Mamba, Attention, CKConv), and
experiments wires those mixers into
Lightning-driven training pipelines.
Next steps#
Architecture — the three-layer nvSubquadratic / subquadratic-ops / megatron-core story and the naming conventions used throughout the library.
Examples — end-to-end training recipes per dataset.
API Reference — the full curated API surface.