API online — live status
Qwen3.6 35B · 262K context · FP8 · A100

Try Qwen3.6 inference in your app in 2 minutes.

OpenAI-compatible API with 1,000,000 input tokens free. No card. Add prepaid credits only after it works for your workload.

$0.15/M input tokens
$0.05/M cached reads
262K model context
No prompt payload logging
Free trial

Get a free API key

We email an API key with 1,000,000 input tokens. One trial per email, expires after 48 hours.

No card No subscription Works with OpenAI SDKs
Launch Week · +50% bonus credits 7d left

Already validated the API? Top up prepaid credits. No auto-renewal.

Already have an account? Recover key →

Your first request after the key arrives

Use these settings in any OpenAI-compatible tool, or paste the cURL request to verify the key immediately.

App settings
ProviderOpenAI-compatible
Base URLhttps://api.lighterhub.app/v1
Modelqwen/qwen3.6-35b-a3b
API keylh_YOUR_KEY
Works in OpenWebUI, LiteLLM, Hermes Agent, Cursor/custom agents, and official OpenAI SDK clients.
cURL test
# After you receive your key by email:
curl https://api.lighterhub.app/v1/chat/completions \
  -H "Authorization: Bearer lh_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3.6-35b-a3b",
    "messages": [{"role": "user", "content": "hello"}]
  }'
SDK examples
Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.lighterhub.app/v1",
    api_key="lh_YOUR_KEY",
)

response = client.chat.completions.create(
    model="qwen/qwen3.6-35b-a3b",
    messages=[{"role": "user", "content": "hello"}],
)
Node.js
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.lighterhub.app/v1",
  apiKey: "lh_YOUR_KEY",
});

const response = await client.chat.completions.create({
  model: "qwen/qwen3.6-35b-a3b",
  messages: [{ role: "user", content: "hello" }],
});

Enough detail to trust the trial

Stress-tested on Qwen3.6 35B-A3B FP8 through the public API endpoint, including Cloudflare Tunnel overhead.

663output tok/s
Sustained aggregate throughput
10-min load test at concurrency 32; aggregate output tokens/sec across all streams
0%
Error rate
3,343 consecutive requests with 0 failures
4ms p50
Health endpoint latency
p95: 5ms for /health; not TTFT or generation latency
1,384output tok/s
Peak aggregate throughput
Burst measurement at concurrency 32; aggregate output tokens/sec, not per-request speed
More benchmark and infrastructure detail
FP8Official Qwen pre-quantized checkpoint, not runtime quantization.
1.79M tokensKV cache capacity for repeated context and long-running agents.
$0.05/MCache read price, 67% lower than fresh input tokens.
429 behaviorClean overload response instead of silent queues or mystery timeouts.
Here, tok/s means output tokens per second, often called output TPS in LLM serving. We spell it out because TPS can also mean transactions per second. Standard apples-to-apples serving comparisons should include prompt/output profile, concurrency, TTFT p50/p95, end-to-end latency p50/p95, output tokens/sec, requests/sec, and error rate. Benchmark source: stress test run 2026-05-08. Full benchmark report available on request.

Common questions

Anything we missed? Email us and we'll add it.

Is there a free trial?
Yes. Submit your email at the top of the page and we'll send you an API key with 1,000,000 input tokens. No credit card required, one trial per email, expires after 48 hours. Use it to make sure the API works for your workload before you deposit anything.
Are paid credits refundable?
No. Paid credit purchases are final and non-refundable once completed, including unused credits. Use the free trial to confirm the API works for your workload before buying. We only correct duplicate charges, billing errors, or refunds required by law. Full details are in our Refund Policy.
What happens if I run out of credits mid-request?
Each request requires a small reservation upfront. If your balance can't cover the reservation, the request returns a 402 error before any tokens are generated, so you never get partially-completed bills. Top up at any of the buy buttons and your existing API key resumes immediately.
Are my prompts and responses logged?
No. Request and response payloads are never written to disk. We log only timestamps, token counts, request IDs, and your API key prefix — strictly for billing and abuse detection. See the Privacy Policy for specifics.
What are the rate limits?
12 concurrent requests and 1.5M reserved in-flight tokens per provider, shared across all users. When the GPU is saturated, you get a clean 429 response so your client can back off — we never silently queue requests. If you need higher dedicated capacity, email us.
Is there an SLA?
Not yet. The service runs on a single dedicated A100 80GB GPU — there's no redundancy, and we don't promise 99.9% uptime. We're transparent about it. Live status is at api.lighterhub.app/health. If reliability matters more than price, larger providers exist; if you want transparent prepaid pricing on dedicated hardware, you're in the right place.
Where is inference run?
United States, on dedicated GPU hardware. Traffic enters via Cloudflare's global edge so latency from outside the US is low, but the actual compute happens on a US-based A100 80GB.

One model built for useful app work

Qwen3.6 35B-A3B powers every request. Use it when you need a practical text model for code, documents, agents, and chat without seat fees or subscriptions.

Qwen3.6 35B-A3B

A strong everyday model for building AI features.

It is a good fit when your product needs to read instructions, reason over text, write code, summarize documents, or respond to users in a chat flow. The API is OpenAI-compatible, so most apps can switch by changing the base URL and model name.

FP8 MoE · 3B active Prefix caching 262K context
Why people care You get a capable text model at usage-based pricing, without committing to seats, contracts, or monthly minimums.
What long context helps with Send long support threads, docs, code files, or retrieved knowledge chunks in one request instead of chopping every task into tiny pieces.
Why caching matters If your app sends the same system prompt or document chunks repeatedly, cached reads are priced lower than fresh input.
Coding assistants

Explain code, draft changes, review diffs, and help users work through implementation details.

Document Q&A

Answer questions from policies, help docs, manuals, tickets, or internal knowledge bases.

Multi-step agents

Plan a task, call tools, keep context, and produce a useful final answer for workflows.

Customer chat

Power support bots, onboarding helpers, and product assistants that need context-aware replies.

Hardware A100 80GB

Dedicated GPU capacity for this endpoint.

Fresh input $0.15/M

Text you send to the model.

Cached read $0.05/M

Repeated prefix context, 67% below fresh input.

Output $1.00/M

Text the model generates back.

Minimums $0

No seat fees, subscriptions, or contracts.

Best fit: coding assistants, multi-step agents, long-context RAG with cached chunks, customer support chat, and internal document workflows.

What stays predictable

The trial stays low-friction while the runtime keeps the production controls buyers ask about first.

Production behavior

Dedicated capacity with clean limits.

Requests run through a controlled OpenAI-compatible wrapper on reserved A100 80GB capacity, with explicit overload behavior and usage accounting.

A100 80GB Dedicated
262K Model context
HTTP 429 Overload
No prompt payload logging

Operational logs keep metadata such as token counts, latency, and status codes, not prompt content.

Usage accounting

Streaming and non-streaming responses include usage objects for predictable prepaid billing.

Health monitored

/health and /readiness track uptime, latency, and backend availability.

Abuse controls

Rate limits, body-size caps, and timing-safe token checks protect the public endpoint.

Request path

Simple public entrypoint, controlled wrapper, dedicated inference backend.

https://api.lighterhub.app/v1/chat/completions
01 Cloudflare edge

Public HTTPS endpoint and tunnel routing.

02 FastAPI wrapper

OpenAI-compatible validation, auth, and billing logic.

03 vLLM backend

Streaming completions with usage objects enforced.

04 A100 node

Reserved GPU capacity for Qwen3.6 serving.

OpenAI SDKsDrop-in endpoint for chat completions.
Prefix cachingLower cached-read price for repeated context.
Clean backoffOverload returns 429 instead of silent queueing.
No charge on 5xxProvider failures are returned cleanly and not billed.