All systems operational — api.lighterhub.app/health
Qwen3.6 35B · 262K context · FP8 · A100

Cheap, fast Qwen3.6 inference.
Pay per token, no signup.

$0.15 / $1.00 / $0.05 per million tokens (input / output / cache). OpenAI-compatible. Drop $10 in, get an API key emailed, start sending requests in under two minutes.

Launch Week · +50% bonus credits on every deposit

Pay with card, Apple Pay, or Google Pay. Your API key is emailed instantly. Launch bonus auto-applied at checkout.

Or try $0.10 in free credits — no card

Test the API on us. ~666K input tokens, enough to verify it works for your workload. One per email.

Already have an account? Recover key →
OpenAI-compatible API
262K context window
Streaming & reasoning supported

One curl command after you buy

Paste your emailed API key into the snippet below. Works with the official OpenAI SDK too.

# After you receive your key by email:
curl https://api.lighterhub.app/v1/chat/completions \
  -H "Authorization: Bearer lh_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen/qwen3.6-35b-a3b","messages":[{"role":"user","content":"hello"}]}'

Performance

Stress-tested on Qwen3.6 35B-A3B FP8. Measured end-to-end through Cloudflare Tunnel, May 2026.

663tok/s
Sustained throughput
10-min load test, concurrency 32, aggregate across all streams
0%
Error rate
3,343 consecutive requests with 0 failures
4ms p50
Health check latency
p95: 5ms — consistent response under load
1,384tok/s
Peak aggregate TPS
Measured at concurrency 32, burst condition
FP8
Quantization
Official Qwen pre-quantized checkpoint — not runtime quantization
1.79M tokens
KV cache capacity
5.3× larger than BF16 — supports ~27× concurrent 65K-context requests
$0.05/M
Cache read price
Prefix caching enabled — 67% off input price on cache hits
64
Concurrent requests
Semaphore-enforced cap; 429 returned cleanly above limit

Benchmark source: stress test run 2026-05-08. Full benchmark report available on request. All figures measured through the public API endpoint including Cloudflare Tunnel overhead.

Drop-in replacement

OpenAI-compatible — change one line and you're done.

Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.lighterhub.app/v1",
    api_key="<your-api-key>",
)

response = client.chat.completions.create(
    model="qwen/qwen3.6-35b-a3b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Direct access also available at api.lighterhub.app/v1/chat/completions with a LighterHub API key.

Built for agentic workloads

Qwen3.6 35B-A3B is a fast MoE model with 3B active parameters. These are the workloads it handles best — and where LighterHub's pricing gives you the most leverage.

🤖

Coding assistants & IDE agents

Low active parameter count means fast decode on code completions and edits. Tool calling supported. Used in production by Cline, Roo Code, and similar agentic coding tools.

$0.15/M input · tool calling · fast decode
🔁

Multi-step agents

Agents repeat the same large system prompt on every step. Prefix caching turns that repeated context into a 67% discount — automatically, no code changes needed.

$0.05/M on cached tokens · automatic KV reuse
📚

Long-context RAG

65K context window with prefix caching on document chunks. Run large document retrieval pipelines without paying full input price on every query that shares the same context.

65K context · cache read at $0.05/M
💬

Multi-turn chat applications

Conversation history grows with each turn. MoE architecture keeps latency flat as context lengthens — no degradation at 20K or 40K tokens the way dense models show.

flat latency · no cold starts · dedicated GPU

Common questions

Anything we missed? Email us and we'll add it.

Is there a free trial?
Yes. Submit your email at the top of the page and we'll send you an API key with $0.10 in credits — about 666K input tokens. No credit card required, one trial per email. Use it to make sure the API works for your workload before you deposit anything.
How do refunds work?
Unused credits are refundable for 14 days from purchase. Consumed credits (successful API requests) are not refundable. Charges from 5xx errors or timeouts are refunded on request. Full details in our Refund Policy.
What happens if I run out of credits mid-request?
Each request requires a small reservation upfront. If your balance can't cover the reservation, the request returns a 402 error before any tokens are generated, so you never get partially-completed bills. Top up at any of the buy buttons and your existing API key resumes immediately.
Are my prompts and responses logged?
No. Request and response payloads are never written to disk. We log only timestamps, token counts, request IDs, and your API key prefix — strictly for billing and abuse detection. See the Privacy Policy for specifics.
What are the rate limits?
12 concurrent requests and 1.5M reserved in-flight tokens per provider, shared across all users. When the GPU is saturated, you get a clean 429 response so your client can back off — we never silently queue requests. If you need higher dedicated capacity, email us.
Can I get an invoice for accounting?
Yes. Stripe sends a receipt automatically after every payment. If you need a formal invoice with VAT, business name, or PO number, email founder@lighterhub.app with the receipt and we'll send a proper invoice within one business day.
What happens after launch week ends?
The 50% bonus turns off and deposits credit at face value. The base pricing ($0.15 / $1.00 / $0.05 per million tokens for input / output / cache) doesn't change. Any credits you bought during launch week — including bonus credits — stay in your account.
I lost my API key. Can I recover it?
Yes. Click "Already have an account? Recover key" near the buy buttons, enter the same email you used at checkout, and we'll send a single-use magic link valid for 30 minutes. The key itself doesn't change, just your access to it.
Is there an SLA?
Not yet. The service runs on a single dedicated A100 80GB GPU — there's no redundancy, and we don't promise 99.9% uptime. We're transparent about it. Live status is at api.lighterhub.app/health. If reliability matters more than price, larger providers exist; if you want indie pricing on real hardware, you're in the right place.
Where is inference run?
United States, on dedicated GPU hardware. Traffic enters via Cloudflare's global edge so latency from outside the US is low, but the actual compute happens on a US-based A100 80GB.

Models

Per-token pricing, no minimums, no seat fees.

Model Hardware Context Input Output Cache read
Qwen3.6 35B-A3B
FP8 MoE · 3B active Prefix caching
A100 80GB
Dedicated
65K
enforced
$0.15/M $1.00/M $0.05/M
67% off input

We focus on high-demand open-weight models where dedicated GPU capacity improves price, latency, or availability. New models added as demand warrants.

Built for reliability

We wrote a custom provider wrapper — not raw vLLM — specifically to meet production requirements.

Dedicated GPU capacity

No cold starts, no shared queues. A100 80GB reserved exclusively for inference — predictable latency at any hour.

🔌

OpenAI-compatible API

Drop-in replacement. Supports tools, logprobs, reasoning, streaming, and structured outputs.

💾

Prefix caching

Automatic KV cache reuse at $0.05/M cache read — 67% off input price on repeated context for RAG, agents, and long system prompts.

🔒

No prompt logging

Payload logging is off by default. Operational logs contain only metadata: token counts, latency, and status codes — never prompt content.

📊

Accurate usage accounting

Usage tokens returned on every response — both streaming (stream_options) and non-streaming paths verified.

🛡️

Predictable failure behavior

HTTP 429 under overload — no unbounded queues, no silent timeouts. Provider failures return 5xx and are never charged to the client.

Built for production

Every request passes through a custom provider wrapper — not raw vLLM — with full usage accounting, rate limiting, and health monitoring.

Inference marketplace
Routes to api.lighterhub.app
HTTPS · Cloudflare Tunnel · TLS 1.3
FastAPI provider wrapper
port 8080 · localhost-only uvicorn
/health /readiness /v1/models 429 on overload
localhost:8000 · not externally reachable
vLLM 0.20.1
Qwen3.6-35B-A3B-FP8 · port 8000
FP8 Marlin prefix cache chunked prefill
PCIe · direct GPU access
NVIDIA A100 80GB PCIe
34.7 GiB model · 1.79M token KV cache · US
Reliability
Uptime monitor pings /health every 30s with persistent JSONL log
Auto-restart loops for wrapper (5s) and vLLM (30s backoff)
/readiness returns structured JSON: uptime_24h, error_rate, latency_p50
Billing accuracy
Usage object enforced on every non-streaming response
Streaming usage via stream_options.include_usage
5xx failures and cancelled streams tracked as non-billable
Security
Auth brute-force rate limiting per IP (20 failures / 5 min)
Concurrency semaphore at 32 · HTTP 429 on overload, no queue
2MB request body cap · timing-safe token comparisons

Get in touch

Questions about integration, capacity, or pricing? We respond within one business day.

founder@lighterhub.app