API online — live status
Qwen3.6 35B · 262K context · FP8 · A100

Try Qwen3.6 inference in your app in 2 minutes.

OpenAI-compatible API with 1,000,000 input tokens free. No card. Add prepaid credits only after it works for your workload.

$0.15/M input tokens
$0.05/M cached reads
262K model context
No prompt payload logging
Free trial

Get a free API key

We email an API key with 1,000,000 input tokens. One trial per email, expires after 48 hours.

No card No subscription Works with OpenAI SDKs
Self-serve marketplace

Prefer RapidAPI?

Get marketplace billing, app keys, and hard monthly limits for the same Qwen3.6 35B long-context endpoint.

Free plan available OpenAI-compatible 262K context

Add credits after the trial works.

No subscription and no auto-renewal. Use direct credits when you want LighterHub billing instead of RapidAPI marketplace billing.

Launch Week · +50% bonus credits 7d left

Already validated the API? Top up prepaid credits. No auto-renewal.

Already have an account? Recover key →
Enterprise inference

Private AI inference without model lock-in.

Cut inference costs while keeping sensitive AI workloads on dedicated capacity. LighterHub helps enterprise teams select, benchmark, and run the right Hugging Face model behind a private OpenAI-compatible API.

Lower inference cost Right-size the model and GPU, tune serving, and benchmark cost per million tokens before committing to a deployment.
Private by design Run through a dedicated endpoint with no prompt payload logging, clear usage reporting, and infrastructure scoped to your workload.
Flexible model path Start with the best-fit Hugging Face model, then switch as your use case, latency target, or budget changes.
Model selection

We compare candidate open models for quality, latency, serving fit, license constraints, and expected token economics.

Dedicated private cluster

A100 is the default starting point. H100 and B200 capacity can be scoped when workload, availability, and review line up.

Fast pilot

Send your traffic profile and latency target. We return a deployment plan before you make a long-term commitment.

Custom deployments are reviewed case by case for model license, GPU fit, provider availability, restricted jurisdictions, and production risk before capacity is promised.

Quantified against comparable Qwen providers.

Current public provider data shows LighterHub at the low end of input pricing, tied for the largest context and max output, and ahead on exposed API controls. Our live sweep adds latency, overload, and usage-accounting evidence.

See benchmark detail
$0.150/M inputTied lowest vs public provider range of $0.150-$0.230/M.
262K max outputTied highest; 4x the 64K provider limit in the current set.
23 API paramsComparable providers expose 13-16 supported parameters.
200-request sweepShort, 8K, 32K, and 64K profiles; successful responses had 100% usage coverage.
Where LighterHub stands today
Compared with the three public Qwen3.6 35B-A3B providers currently visible through OpenRouter endpoint data.
Public data + live tests
Metric LighterHub Provider range Position
Input price $0.150/M input tokens. $0.150-$0.230/M. Tied lowest among the current comparable set.
Output price $1.000/M output tokens. $0.965-$1.800/M. Within 3.6% of the lowest, 44% below the highest.
Cache-read price $0.050/M cached input tokens. $0.050-$0.161/M where exposed. Tied lowest; 69% lower than the $0.161/M cache route.
Context window 262,144 tokens. 262,144 tokens. Parity with the largest public routes.
Max output 262,144 tokens. 65,536-262,144 tokens. Tied highest; 4x the smallest current provider limit.
Supported API params 23 supported parameters. 13-16 supported parameters. +7 over the nearest comparable route.
64K cached-prefix profile 1.07s p95 latency at 4-way concurrency; 0% errors. Not measured by us with identical prompts. Published as LighterHub benchmark evidence, not a cross-provider speed claim.
Overload behavior Clean 429 at saturated 64K concurrency. Not externally visible from public specs. Clients can back off instead of waiting on hidden queues.
Usage accounting 100% usage coverage across successful benchmark requests. Not externally visible from public specs. Every successful response includes billing-grade usage data.
Provider range is from OpenRouter endpoint data checked on May 9, 2026. LighterHub benchmark data is from a 200-request local provider sweep on May 9, 2026. Competitor names are intentionally anonymized; raw speed still requires identical cross-provider prompts, concurrency, cache state, and route conditions.

Your first request after the key arrives

Use these settings in any OpenAI-compatible tool, or paste the cURL request to verify the key immediately.

App settings
ProviderOpenAI-compatible
Base URLhttps://api.lighterhub.app/v1
Modelqwen/qwen3.6-35b-a3b
API keylh_YOUR_KEY
Works in OpenWebUI, LiteLLM, Hermes Agent, Cursor/custom agents, and official OpenAI SDK clients.
cURL test
# After you receive your key by email:
curl https://api.lighterhub.app/v1/chat/completions \
  -H "Authorization: Bearer lh_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3.6-35b-a3b",
    "messages": [{"role": "user", "content": "hello"}]
  }'
SDK examples
Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.lighterhub.app/v1",
    api_key="lh_YOUR_KEY",
)

response = client.chat.completions.create(
    model="qwen/qwen3.6-35b-a3b",
    messages=[{"role": "user", "content": "hello"}],
)
Node.js
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.lighterhub.app/v1",
  apiKey: "lh_YOUR_KEY",
});

const response = await client.chat.completions.create({
  model: "qwen/qwen3.6-35b-a3b",
  messages: [{ role: "user", content: "hello" }],
});

Enough detail to trust the trial

Stress-tested on Qwen3.6 35B-A3B FP8 through the public API endpoint, including Cloudflare Tunnel overhead.

663output tok/s
Sustained aggregate throughput
10-min load test at concurrency 32; aggregate output tokens/sec across all streams
0%
Error rate
3,343 consecutive requests with 0 failures
4ms p50
Health endpoint latency
p95: 5ms for /health; not TTFT or generation latency
1,384output tok/s
Peak aggregate throughput
Burst measurement at concurrency 32; aggregate output tokens/sec, not per-request speed
More benchmark and infrastructure detail
FP8Official Qwen pre-quantized checkpoint, not runtime quantization.
1.94M tokensKV cache capacity for repeated context and long-running agents.
$0.05/MCache read price, 67% lower than fresh input tokens.
429 behaviorClean overload response instead of silent queues or mystery timeouts.
Here, tok/s means output tokens per second, often called output TPS in LLM serving. We spell it out because TPS can also mean transactions per second. Standard apples-to-apples serving comparisons should include prompt/output profile, concurrency, TTFT p50/p95, end-to-end latency p50/p95, output tokens/sec, requests/sec, and error rate. Benchmark sources: 10-minute load test run 2026-05-08 and operational routing sweep run 2026-05-09. Full benchmark reports available on request.

Common questions

Anything we missed? Email us and we'll add it.

Is there a free trial?
Yes. Submit your email at the top of the page and we'll send you an API key with 1,000,000 input tokens. No credit card required, one trial per email, expires after 48 hours. Use it to make sure the API works for your workload before you deposit anything.
Are paid credits refundable?
No. Paid credit purchases are final and non-refundable once completed, including unused credits. Use the free trial to confirm the API works for your workload before buying. We only correct duplicate charges, billing errors, or refunds required by law. Full details are in our Refund Policy.
What happens if I run out of credits mid-request?
Each request requires a small reservation upfront. If your balance can't cover the reservation, the request returns a 402 error before any tokens are generated, so you never get partially-completed bills. Top up at any of the buy buttons and your existing API key resumes immediately.
Are my prompts and responses logged?
No. Request and response payloads are never written to disk. We log only timestamps, token counts, request IDs, and your API key prefix — strictly for billing and abuse detection. See the Privacy Policy for specifics.
What are the rate limits?
12 concurrent requests and 1.5M reserved in-flight tokens per provider, shared across all users. When the GPU is saturated, you get a clean 429 response so your client can back off — we never silently queue requests. If you need higher dedicated capacity, email us.
Can you host a custom Hugging Face model for us?
Yes, if the model is available on Hugging Face, the license permits your use case, and the workload fits available GPU capacity. We can help choose the right model, benchmark alternatives, and move you to a different model as requirements change. A100 deployments are the default starting point; H100 and B200 capacity is reviewed case by case.
Is there an SLA?
Not yet. The service runs on a single dedicated A100 80GB GPU — there's no redundancy, and we don't promise 99.9% uptime. We're transparent about it. Live status is at api.lighterhub.app/health. If reliability matters more than price, larger providers exist; if you want transparent prepaid pricing on dedicated hardware, you're in the right place.
Where is inference run?
United States, on dedicated GPU hardware. Traffic enters via Cloudflare's global edge so latency from outside the US is low, but the actual compute happens on a US-based A100 80GB.
Self-serve access

Available on RapidAPI

Prefer marketplace billing and managed API keys? Subscribe through RapidAPI for OpenAI-compatible Qwen3.6 35B long-context inference with hard monthly request limits and backend-enforced input caps.

Basic $0/mo

50K tokens, 25 requests, 8K max input.

Pro $9.99/mo

5M tokens, 500 requests, 64K max input.

Ultra $29.99/mo

25M tokens, 2,000 requests, 128K max input.

Mega $99/mo

100M tokens, 5,000 requests, 255K max input.

One model built for useful app work

Qwen3.6 35B-A3B powers every request. Use it when you need a practical text model for code, documents, agents, and chat without seat fees or subscriptions.

Qwen3.6 35B-A3B

A strong everyday model for building AI features.

It is a good fit when your product needs to read instructions, reason over text, write code, summarize documents, or respond to users in a chat flow. The API is OpenAI-compatible, so most apps can switch by changing the base URL and model name.

FP8 MoE · 3B active Prefix caching 262K context
Why people care You get a capable text model at usage-based pricing, without committing to seats, contracts, or monthly minimums.
What long context helps with Send long support threads, docs, code files, or retrieved knowledge chunks in one request instead of chopping every task into tiny pieces.
Why caching matters If your app sends the same system prompt or document chunks repeatedly, cached reads are priced lower than fresh input.
Coding assistants

Explain code, draft changes, review diffs, and help users work through implementation details.

Document Q&A

Answer questions from policies, help docs, manuals, tickets, or internal knowledge bases.

Multi-step agents

Plan a task, call tools, keep context, and produce a useful final answer for workflows.

Customer chat

Power support bots, onboarding helpers, and product assistants that need context-aware replies.

Hardware A100 80GB

Dedicated GPU capacity for this endpoint.

Fresh input $0.15/M

Text you send to the model.

Cached read $0.05/M

Repeated prefix context, 67% below fresh input.

Output $1.00/M

Text the model generates back.

Minimums $0

No seat fees, subscriptions, or contracts.

Best fit: coding assistants, multi-step agents, long-context RAG with cached chunks, customer support chat, and internal document workflows.

What stays predictable

The trial stays low-friction while the runtime keeps the production controls buyers ask about first.

Production behavior

Dedicated capacity with clean limits.

Requests run through a controlled OpenAI-compatible wrapper on reserved A100 80GB capacity, with explicit overload behavior and usage accounting.

A100 80GB Dedicated
262K Model context
HTTP 429 Overload
No prompt payload logging

Operational logs keep metadata such as token counts, latency, and status codes, not prompt content.

Usage accounting

Streaming and non-streaming responses include usage objects for predictable prepaid billing.

Health monitored

/health and /readiness track uptime, latency, and backend availability.

Abuse controls

Rate limits, body-size caps, and timing-safe token checks protect the public endpoint.

Request path

Simple public entrypoint, controlled wrapper, dedicated inference backend.

https://api.lighterhub.app/v1/chat/completions
01 Cloudflare edge

Public HTTPS endpoint and tunnel routing.

02 FastAPI wrapper

OpenAI-compatible validation, auth, and billing logic.

03 vLLM backend

Streaming completions with usage objects enforced.

04 A100 node

Reserved GPU capacity for Qwen3.6 serving.

OpenAI SDKsDrop-in endpoint for chat completions.
Prefix cachingLower cached-read price for repeated context.
Clean backoffOverload returns 429 instead of silent queueing.
No charge on 5xxProvider failures are returned cleanly and not billed.