Now in Private Beta

The Vercel for Model Training

Ship ML faster. LorientAI abstracts away distributed compute, checkpointing, and orchestration so you can focus on what matters: your models.

train.py
import lorient

# Initialize training run
run = lorient.Run(
    model="llama-3-8b",
    dataset="my-dataset",
    compute="auto"  # We handle the infra
)

# Start distributed training
run.train(
    epochs=3,
    checkpoint_every="1h",
    eval_on=["hellaswag", "mmlu"]
)

# Deploy when ready
run.deploy(endpoint="production")
94%less infra code
The Problem

ML infrastructure is broken

Distributed training requires PhD-level expertise

Setting up multi-GPU and multi-node training means weeks of DevOps work before you write a single line of model code.

Checkpointing is an afterthought

Until a preemption costs you 72 hours of compute. Then it becomes a fire drill.

Data pipelines are duct tape

Terabytes of training data scattered across S3, local disks, and that one researcher's laptop.

Dev doesn't match production

Code that works on your M2 Mac mysteriously fails on your A100 cluster.

The Solution

Infrastructure that disappears

Focus on your models. We handle everything else.

Training APIs

Distributed training without the distributed systems expertise.

  • Multi-GPU & multi-node out of the box
  • Automatic checkpointing & recovery
  • Built-in orchestration & scheduling

Data Infrastructure

Your training data, organized and accessible at any scale.

  • Git-like versioning for datasets
  • Streaming at terabyte scale
  • Visual data exploration tools

Dev Environments

Research iteration at the speed of thought.

  • Production-identical environments
  • Spin up in seconds, not hours
  • On-demand GPU access
How It Works

From zero to training in minutes

01

Connect

Point to your model and data. We support HuggingFace, custom PyTorch, and JAX.

config.py
lorient.connect(
  model="./my-llm",
  data="s3://bucket/training-data"
)
02

Configure

Define your training parameters. We auto-optimize for your hardware.

config.py
config = lorient.Config(
  batch_size="auto",
  learning_rate=2e-5,
  precision="bf16",
  checkpoint_interval="30m"
)
03

Launch

One command to start distributed training. We handle scaling, failures, and optimization.

terminal
lorient run train \
  --config ./config.yaml \
  --compute 8xA100 \
  --watch
Technical Deep Dive

The Mathematics Under the Hood

LorientAI abstracts complexity—but we believe in transparency. Here's the rigorous foundation powering your training runs.

§ Gradient Optimization & Numerical Methods

Distributed Gradient Accumulation

∇L(θ) = (1/N) Σᵢ₌₁ᴺ ∇Lᵢ(θ) ≈ (1/|B|) Σⱼ∈B ∇Lⱼ(θ)

LorientAI implements ring-allreduce with gradient compression using Top-K sparsification where K = ⌈ρ·d⌉ for sparsity ratio ρ ∈ (0,1]. Error feedback accumulation ensures convergence: eₜ₊₁ = eₜ + gₜ - Compress(eₜ + gₜ)

Adaptive Learning Rate Scheduling

ηₜ = η₀ · min(t⁻⁰·⁵, t · warmup⁻¹·⁵) · √(1 - β₂ᵗ)/(1 - β₁ᵗ)

Our scheduler combines linear warmup with inverse square root decay, modulated by Adam's bias correction terms. For large batch training, we apply LARS: η̂ₗ = η · ‖wₗ‖/(‖∇L(wₗ)‖ + β‖wₗ‖) per layer l.

Mixed-Precision Numerics

x̃ = clamp(⌊x/s⌉, -2ᵇ⁻¹, 2ᵇ⁻¹-1) · s, where s = max(|x|)/(2ᵇ⁻¹-1)

BF16 master weights with FP8 forward pass (E4M3) and backward pass (E5M2). Loss scaling with dynamic exponent: scale = 2^k where k = argmax{2^k · ‖∇L‖∞ < FP16_MAX}.

Tensor Parallelism Sharding

Y = GeLU(XA₁)A₂ → Y = GeLU(X[A₁]ᶜᵒˡ)·AllReduce([A₂]ʳᵒʷ)

Column-parallel linear layers partition A ∈ ℝᵈˣᵏ into [A₁|A₂|...|Aₚ] across P devices. For attention: Q,K,V heads distributed with Σᵢ softmax(QᵢKᵢᵀ/√dₖ)Vᵢ computed locally before all-gather.

§ Convergence Theory & Complexity Analysis

Convergence Guarantees

For L-smooth, μ-strongly convex objectives with SGD:

𝔼[‖θₜ - θ*‖²] ≤ (1 - μη)ᵗ‖θ₀ - θ*‖² + ησ²/μ

Where σ² bounds gradient variance. Our adaptive batching maintains:
Var(ĝ) = σ²/|B| ≤ ε² ⟹ |B| ≥ σ²/ε²

Critical batch size Bₖᵣᵢₜ = tr(H⁻¹Σ)/‖∇L‖² determines scaling efficiency.

Second-Order Approximations

Shampoo preconditioner for matrix parameter W ∈ ℝᵐˣⁿ:

Lₜ = (Σₛ₌₁ᵗ GₛGₛᵀ + εI)^(1/4) ∈ ℝᵐˣᵐ
Rₜ = (Σₛ₌₁ᵗ GₛᵀGₛ + εI)^(1/4) ∈ ℝⁿˣⁿ

Update: Wₜ₊₁ = Wₜ - η·Lₜ⁻¹GₜRₜ⁻¹

Matrix roots computed via coupled Newton iteration:
Xₖ₊₁ = ½(Xₖ + Yₖ⁻¹), Yₖ₊₁ = ½(Yₖ + Xₖ⁻¹)

Attention Complexity Reduction

Standard attention: O(n²d) time, O(n²) memory

Flash Attention tiling with block sizes Bᵣ, Bᶜ:
- Load Qᵢ ∈ ℝ^(Bᵣ×d), Kⱼ,Vⱼ ∈ ℝ^(Bᶜ×d) to SRAM
- Compute Sᵢⱼ = QᵢKⱼᵀ ∈ ℝ^(Bᵣ×Bᶜ)
- Online softmax: mᵢⱼ = max(mᵢ,ⱼ₋₁, rowmax(Sᵢⱼ))
- Rescale: ℓᵢⱼ = e^(mᵢ,ⱼ₋₁-mᵢⱼ)ℓᵢ,ⱼ₋₁ + rowsum(e^(Sᵢⱼ-mᵢⱼ))

IO complexity: O(n²d²/M) for SRAM size M.

Gradient Checkpointing Trade-offs

Memory-compute Pareto frontier for transformer with L layers:

Standard: O(L·n·d) memory, O(1) recomputation
√L checkpointing: O(√L·n·d) memory, O(√L) recomputation
Selective: checkpoint at layers {⌊iL/k⌋ : i ∈ [k]}

Optimal k minimizes: T(k) = T_fwd(1 + (L-k)/k) + T_bwd
Subject to: M(k) = M_act·k + M_param ≤ M_available

Our solver uses DP: V(l,m) = min over checkpoints c ∈ [l]

§ Distributed Systems & Communication Primitives

Pipeline Parallelism Scheduling

1F1B (One Forward One Backward) steady state:

Microbatch latency: T_mb = (p-1)·(t_f + t_b) + t_f + t_b
Pipeline bubble ratio: β = (p-1)/(m + p - 1)

For p stages, m microbatches, minimize β subject to:
m·(memory per microbatch) ≤ available memory

Interleaved schedule with v virtual stages:
β_interleaved = (p-1)/(m·v + p - 1)

Zero Bubble scheduling eliminates β via:
- W (weight gradient) decoupled from B (activation gradient)
- Schedule: F₁F₂...FₚB₁W₁B₂W₂...BₚWₚ

Communication Topology Optimization

Ring AllReduce for N workers, message size M:

T_ring = 2(N-1)/N · (α + M·β/N)

Where α = latency, β = inverse bandwidth.

For hierarchical networks (intra-node NVLink, inter-node IB):
T_hier = T_intra_reduce + T_inter_allreduce + T_intra_broadcast

Bandwidth-optimal 2D torus AllReduce:
T_2D = 2·(√N - 1)·(α + M·β/√N)

LorientAI auto-selects: argmin_{topo} T_topo(N, M, α, β)

ZeRO Memory Optimization

Memory per GPU for model with Ψ parameters:

Stage 1 (Optimizer States): 4Ψ + 12Ψ/N
Stage 2 (+ Gradients): 2Ψ + (2 + 12)Ψ/N
Stage 3 (+ Parameters): 16Ψ/N

Communication overhead per step:
ZeRO-1: 0 extra (optimizer step local)
ZeRO-2: Ψ · sizeof(grad) AllReduce → Reduce-Scatter
ZeRO-3: 2 · AllGather(Ψ/N) per layer (fwd + bwd)

Partition granularity g trades memory vs communication:
Memory: Ψ·sizeof(param)/g per partition
Comm: O(g) AllGather calls per layer

§ Transformer Architecture Primitives

Rotary Position Embeddings

Rθ,m = [cos(mθ₁) -sin(mθ₁) 0 ... sin(mθ₁) cos(mθ₁) 0 ... 0 0 cos(mθ₂) ... ...]

RoPE encodes position via rotation: (Rθ,mqₘ)ᵀ(Rθ,nkₙ) = qₘᵀRθ,n-mkₙ. Base frequency θᵢ = 10000^(-2i/d). NTK-aware scaling: θ'ᵢ = θᵢ · α^(d/(d-2i)) for context extension factor α.

Grouped Query Attention

Attention(Q, K, V) = softmax(QKᵀ/√dₖ + M)V where K,V ∈ ℝ^(n×dₖ/g)

GQA with g groups: nₕ heads share nₖᵥ = nₕ/g key-value heads. Memory: O(2·n·d/g) vs O(2·n·d) for MHA. Interpolation from MQA via mean pooling: K_gqa = mean(K_mha[g·i:(g+1)·i]) per group.

SwiGLU Activation

SwiGLU(x, W, V, b, c) = Swish(xW + b) ⊗ (xV + c)

Where Swish(x) = x·σ(βx) and σ is sigmoid. Hidden dim d_ff = ⌊(8/3·d_model·2/3)⌋ rounded to multiple of 256 for tensor core alignment. Gradient: ∂SwiGLU/∂x = σ'(Wx)⊗(Vx)·W + Swish(Wx)·V

RMSNorm Numerical Stability

RMSNorm(x) = x/RMS(x) · γ, where RMS(x) = √(Σxᵢ²/d + ε)

Backward pass: ∂L/∂x = γ/RMS(x)·(∂L/∂y - ȳ·mean(∂L/∂y ⊙ y)). For mixed precision: compute RMS in FP32, cast to BF16 for multiply. Fused kernel: single pass with Welford's online variance.

Loss Landscape & Scaling Laws

Chinchilla Optimal Compute

Given compute budget C (in FLOPs):
N_opt ∝ C^0.5  (optimal parameters)
D_opt ∝ C^0.5  (optimal tokens)

Loss scaling: L(N,D) = E + A/N^α + B/D^β
Where α ≈ 0.34, β ≈ 0.28, E = irreducible loss

Compute-optimal: C = 6·N·D (forward pass)
Including backward: C_total ≈ 6·N·D·(1 + 2) = 18·N·D

μP Hyperparameter Transfer

Maximal Update Parametrization for width scaling:

Input weights:  W_in ~ N(0, 1/d_in)
Output weights: W_out ~ N(0, 1/d_model)
Attention logits: QKᵀ/d_head (not √d_head)

Learning rate scaling:
  η_embed = η_base · m_width
  η_hidden = η_base
  η_output = η_base / m_width

Where m_width = d_model / d_base.
Enables HP transfer from 10M → 10B params.

Emergent Capabilities Phase Transitions

Capability emergence as function of compute C:

P(capability | C) = σ((log C - log C_crit) / τ)

Where C_crit = critical compute threshold, τ = sharpness.

Sharp transitions occur when:
∂²L/∂C² changes sign (loss curvature inflection)

Grokking dynamics: generalization after memorization
t_grok ∝ 1/(λ_reg · |S_train|) for regularization λ_reg

LorientAI monitors: ‖∇L_train‖/‖∇L_val‖ ratio for emergence detection.

All optimizations are applied automatically. You write lorient.train()—we handle the Hessian-vector products.

Results that matter

10xFaster iteration

Go from idea to trained model in hours, not weeks.

94%Less infra code

Stop writing YAML. Start shipping models.

ZeroCold starts

Environments ready when you are.

$0Wasted on failures

Automatic checkpointing means no lost progress.

Meet the team

Engineers and researchers building the future of ML infrastructure.

Nimil Patel

Co-Founder, CEO

Nimil Patel

Built agent copilot at funded startup. Deployed on $1M+/yr client. BS Applied Math, UNC.

Naresh Dhiman

Co-Founder, CTO

Naresh Dhiman

Serial entrepreneur. Techstars alum. Startup acquired by Groupon. 15+ years full-stack development experience.

Vivek Srinivas

Co-Founder, ML Lead

Vivek Srinivas

Data Engineer @ Acorns. MS CS, ASU — 4.0 GPA. US Patent (LSTM).

Tata Avirneni

DevOps & Operations

Tata Avirneni

AWS, Kubernetes, cloud infra. MS IT, Arkansas Tech.

Ready to ship ML faster?

Join the private beta and get early access to infrastructure that lets you focus on what matters.

Join 500+ ML engineers on the waitlist. No spam, ever.