OpenAI-compatible API with 1,000,000 input tokens free. No card. Add prepaid credits only after it works for your workload.
We email an API key with 1,000,000 input tokens. One trial per email, expires after 48 hours.
Get marketplace billing, app keys, and hard monthly limits for the same Qwen3.6 35B long-context endpoint.
No subscription and no auto-renewal. Use direct credits when you want LighterHub billing instead of RapidAPI marketplace billing.
Already validated the API? Top up prepaid credits. No auto-renewal.
Cut inference costs while keeping sensitive AI workloads on dedicated capacity. LighterHub helps enterprise teams select, benchmark, and run the right Hugging Face model behind a private OpenAI-compatible API.
We compare candidate open models for quality, latency, serving fit, license constraints, and expected token economics.
A100 is the default starting point. H100 and B200 capacity can be scoped when workload, availability, and review line up.
Send your traffic profile and latency target. We return a deployment plan before you make a long-term commitment.
Custom deployments are reviewed case by case for model license, GPU fit, provider availability, restricted jurisdictions, and production risk before capacity is promised.
Current public provider data shows LighterHub at the low end of input pricing, tied for the largest context and max output, and ahead on exposed API controls. Our live sweep adds latency, overload, and usage-accounting evidence.
| Metric | LighterHub | Provider range | Position |
|---|---|---|---|
| Input price | $0.150/M input tokens. | $0.150-$0.230/M. | Tied lowest among the current comparable set. |
| Output price | $1.000/M output tokens. | $0.965-$1.800/M. | Within 3.6% of the lowest, 44% below the highest. |
| Cache-read price | $0.050/M cached input tokens. | $0.050-$0.161/M where exposed. | Tied lowest; 69% lower than the $0.161/M cache route. |
| Context window | 262,144 tokens. | 262,144 tokens. | Parity with the largest public routes. |
| Max output | 262,144 tokens. | 65,536-262,144 tokens. | Tied highest; 4x the smallest current provider limit. |
| Supported API params | 23 supported parameters. | 13-16 supported parameters. | +7 over the nearest comparable route. |
| 64K cached-prefix profile | 1.07s p95 latency at 4-way concurrency; 0% errors. | Not measured by us with identical prompts. | Published as LighterHub benchmark evidence, not a cross-provider speed claim. |
| Overload behavior | Clean 429 at saturated 64K concurrency. | Not externally visible from public specs. | Clients can back off instead of waiting on hidden queues. |
| Usage accounting | 100% usage coverage across successful benchmark requests. | Not externally visible from public specs. | Every successful response includes billing-grade usage data. |
Use these settings in any OpenAI-compatible tool, or paste the cURL request to verify the key immediately.
OpenAI-compatiblehttps://api.lighterhub.app/v1qwen/qwen3.6-35b-a3blh_YOUR_KEY# After you receive your key by email: curl https://api.lighterhub.app/v1/chat/completions \ -H "Authorization: Bearer lh_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen/qwen3.6-35b-a3b", "messages": [{"role": "user", "content": "hello"}] }'
from openai import OpenAI client = OpenAI( base_url="https://api.lighterhub.app/v1", api_key="lh_YOUR_KEY", ) response = client.chat.completions.create( model="qwen/qwen3.6-35b-a3b", messages=[{"role": "user", "content": "hello"}], )
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.lighterhub.app/v1",
apiKey: "lh_YOUR_KEY",
});
const response = await client.chat.completions.create({
model: "qwen/qwen3.6-35b-a3b",
messages: [{ role: "user", content: "hello" }],
});
Stress-tested on Qwen3.6 35B-A3B FP8 through the public API endpoint, including Cloudflare Tunnel overhead.
Prefer marketplace billing and managed API keys? Subscribe through RapidAPI for OpenAI-compatible Qwen3.6 35B long-context inference with hard monthly request limits and backend-enforced input caps.
50K tokens, 25 requests, 8K max input.
5M tokens, 500 requests, 64K max input.
25M tokens, 2,000 requests, 128K max input.
100M tokens, 5,000 requests, 255K max input.
Use the live Qwen endpoint today, or request a private benchmark for state-of-the-art Hugging Face models that fit your workload, latency target, and model license.
The live API is built for apps that need long context, practical reasoning, coding support, document workflows, and OpenAI-compatible integration without seat fees or subscriptions.
Dedicated GPU capacity for this endpoint.
Text you send to the model.
Repeated prefix context, 67% below fresh input.
Text the model generates back.
No seat fees, subscriptions, or contracts.
Public pricing applies to the live Qwen endpoint. Private model deployments use benchmarked pricing because throughput, context length, and memory pressure vary by model.
Deploy Mistral's Magistral Small 2509 for multimodal analysis and structured reasoning workflows.
Run Qwen3-Coder 30B-A3B for code review, implementation help, and tool-aware developer workflows.
Use QwQ-32B when a private text-only reasoning route fits the task better than the live model.
Benchmark Gemma 4 31B IT when teams want a Google open model alongside Qwen and Mistral options.
State-of-the-art Mistral reasoning model for analysis, multimodal workflows, and enterprise pilots that need an Apache 2.0 model path.
View modelCoding-focused MoE model for code review, repository Q&A, implementation support, and private developer tools.
View modelPrivate text-reasoning route for math, planning, logic-heavy analysis, and workloads where a dedicated Qwen reasoning model is useful.
View modelGeneral-purpose Google open model for enterprise comparison tests, multimodal prototypes, and private app evaluation.
View modelSend a Hugging Face model URL and traffic profile. We check license, GPU fit, context target, latency, and token economics before recommending a deployment plan.
The trial stays low-friction while the runtime keeps the production controls buyers ask about first.
Requests run through a controlled OpenAI-compatible wrapper on reserved A100 80GB capacity, with explicit overload behavior and usage accounting.
Operational logs keep metadata such as token counts, latency, and status codes, not prompt content.
Streaming and non-streaming responses include usage objects for predictable prepaid billing.
/health and /readiness track uptime, latency, and backend availability.
Rate limits, body-size caps, and timing-safe token checks protect the public endpoint.
Simple public entrypoint, controlled wrapper, dedicated inference backend.
https://api.lighterhub.app/v1/chat/completions
Public HTTPS endpoint and tunnel routing.
OpenAI-compatible validation, auth, and billing logic.
Streaming completions with usage objects enforced.
Reserved GPU capacity for Qwen3.6 serving.