$0.15 / $1.00 / $0.05 per million tokens (input / output / cache). OpenAI-compatible. Drop $10 in, get an API key emailed, start sending requests in under two minutes.
Pay with card, Apple Pay, or Google Pay. Your API key is emailed instantly. Launch bonus auto-applied at checkout.
Test the API on us. ~666K input tokens, enough to verify it works for your workload. One per email.
Paste your emailed API key into the snippet below. Works with the official OpenAI SDK too.
# After you receive your key by email: curl https://api.lighterhub.app/v1/chat/completions \ -H "Authorization: Bearer lh_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"qwen/qwen3.6-35b-a3b","messages":[{"role":"user","content":"hello"}]}'
Stress-tested on Qwen3.6 35B-A3B FP8. Measured end-to-end through Cloudflare Tunnel, May 2026.
Benchmark source: stress test run 2026-05-08. Full benchmark report available on request. All figures measured through the public API endpoint including Cloudflare Tunnel overhead.
OpenAI-compatible — change one line and you're done.
from openai import OpenAI client = OpenAI( base_url="https://api.lighterhub.app/v1", api_key="<your-api-key>", ) response = client.chat.completions.create( model="qwen/qwen3.6-35b-a3b", messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content)
Qwen3.6 35B-A3B is a fast MoE model with 3B active parameters. These are the workloads it handles best — and where LighterHub's pricing gives you the most leverage.
Low active parameter count means fast decode on code completions and edits. Tool calling supported. Used in production by Cline, Roo Code, and similar agentic coding tools.
Agents repeat the same large system prompt on every step. Prefix caching turns that repeated context into a 67% discount — automatically, no code changes needed.
65K context window with prefix caching on document chunks. Run large document retrieval pipelines without paying full input price on every query that shares the same context.
Conversation history grows with each turn. MoE architecture keeps latency flat as context lengthens — no degradation at 20K or 40K tokens the way dense models show.
Per-token pricing, no minimums, no seat fees.
| Model | Hardware | Context | Input | Output | Cache read |
|---|---|---|---|---|---|
|
Qwen3.6 35B-A3B
|
A100 80GB Dedicated |
65K enforced |
$0.15/M | $1.00/M |
$0.05/M
67% off input
|
We focus on high-demand open-weight models where dedicated GPU capacity improves price, latency, or availability. New models added as demand warrants.
We wrote a custom provider wrapper — not raw vLLM — specifically to meet production requirements.
No cold starts, no shared queues. A100 80GB reserved exclusively for inference — predictable latency at any hour.
Drop-in replacement. Supports tools, logprobs, reasoning, streaming, and structured outputs.
Automatic KV cache reuse at $0.05/M cache read — 67% off input price on repeated context for RAG, agents, and long system prompts.
Payload logging is off by default. Operational logs contain only metadata: token counts, latency, and status codes — never prompt content.
Usage tokens returned on every response — both streaming (stream_options) and non-streaming paths verified.
HTTP 429 under overload — no unbounded queues, no silent timeouts. Provider failures return 5xx and are never charged to the client.
Every request passes through a custom provider wrapper — not raw vLLM — with full usage accounting, rate limiting, and health monitoring.
Questions about integration, capacity, or pricing? We respond within one business day.
founder@lighterhub.app