OpenAI-compatible API with 1,000,000 input tokens free. No card. Add prepaid credits only after it works for your workload.
We email an API key with 1,000,000 input tokens. One trial per email, expires after 48 hours.
Already validated the API? Top up prepaid credits. No auto-renewal.
Cut inference costs while keeping sensitive AI workloads on dedicated capacity. LighterHub helps enterprise teams select, benchmark, and run the right Hugging Face model behind a private OpenAI-compatible API.
We compare candidate open models for quality, latency, serving fit, license constraints, and expected token economics.
A100 is the default starting point. H100 and B200 capacity can be scoped when workload, availability, and review line up.
Send your traffic profile and latency target. We return a deployment plan before you make a long-term commitment.
Custom deployments are reviewed case by case for model license, GPU fit, provider availability, restricted jurisdictions, and production risk before capacity is promised.
Current public provider data shows LighterHub at the low end of input pricing, tied for the largest context and max output, and ahead on exposed API controls. Our live sweep adds latency, overload, and usage-accounting evidence.
| Metric | LighterHub | Provider range | Position |
|---|---|---|---|
| Input price | $0.150/M input tokens. | $0.150-$0.230/M. | Tied lowest among the current comparable set. |
| Output price | $1.000/M output tokens. | $0.965-$1.800/M. | Within 3.6% of the lowest, 44% below the highest. |
| Cache-read price | $0.050/M cached input tokens. | $0.050-$0.161/M where exposed. | Tied lowest; 69% lower than the $0.161/M cache route. |
| Context window | 262,144 tokens. | 262,144 tokens. | Parity with the largest public routes. |
| Max output | 262,144 tokens. | 65,536-262,144 tokens. | Tied highest; 4x the smallest current provider limit. |
| Supported API params | 23 supported parameters. | 13-16 supported parameters. | +7 over the nearest comparable route. |
| 64K cached-prefix profile | 1.07s p95 latency at 4-way concurrency; 0% errors. | Not measured by us with identical prompts. | Published as LighterHub benchmark evidence, not a cross-provider speed claim. |
| Overload behavior | Clean 429 at saturated 64K concurrency. | Not externally visible from public specs. | Clients can back off instead of waiting on hidden queues. |
| Usage accounting | 100% usage coverage across successful benchmark requests. | Not externally visible from public specs. | Every successful response includes billing-grade usage data. |
Use these settings in any OpenAI-compatible tool, or paste the cURL request to verify the key immediately.
OpenAI-compatiblehttps://api.lighterhub.app/v1qwen/qwen3.6-35b-a3blh_YOUR_KEY# After you receive your key by email: curl https://api.lighterhub.app/v1/chat/completions \ -H "Authorization: Bearer lh_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen/qwen3.6-35b-a3b", "messages": [{"role": "user", "content": "hello"}] }'
from openai import OpenAI client = OpenAI( base_url="https://api.lighterhub.app/v1", api_key="lh_YOUR_KEY", ) response = client.chat.completions.create( model="qwen/qwen3.6-35b-a3b", messages=[{"role": "user", "content": "hello"}], )
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.lighterhub.app/v1",
apiKey: "lh_YOUR_KEY",
});
const response = await client.chat.completions.create({
model: "qwen/qwen3.6-35b-a3b",
messages: [{ role: "user", content: "hello" }],
});
Stress-tested on Qwen3.6 35B-A3B FP8 through the public API endpoint, including Cloudflare Tunnel overhead.
Qwen3.6 35B-A3B powers every request. Use it when you need a practical text model for code, documents, agents, and chat without seat fees or subscriptions.
It is a good fit when your product needs to read instructions, reason over text, write code, summarize documents, or respond to users in a chat flow. The API is OpenAI-compatible, so most apps can switch by changing the base URL and model name.
Explain code, draft changes, review diffs, and help users work through implementation details.
Answer questions from policies, help docs, manuals, tickets, or internal knowledge bases.
Plan a task, call tools, keep context, and produce a useful final answer for workflows.
Power support bots, onboarding helpers, and product assistants that need context-aware replies.
Dedicated GPU capacity for this endpoint.
Text you send to the model.
Repeated prefix context, 67% below fresh input.
Text the model generates back.
No seat fees, subscriptions, or contracts.
Best fit: coding assistants, multi-step agents, long-context RAG with cached chunks, customer support chat, and internal document workflows.
The trial stays low-friction while the runtime keeps the production controls buyers ask about first.
Requests run through a controlled OpenAI-compatible wrapper on reserved A100 80GB capacity, with explicit overload behavior and usage accounting.
Operational logs keep metadata such as token counts, latency, and status codes, not prompt content.
Streaming and non-streaming responses include usage objects for predictable prepaid billing.
/health and /readiness track uptime, latency, and backend availability.
Rate limits, body-size caps, and timing-safe token checks protect the public endpoint.
Simple public entrypoint, controlled wrapper, dedicated inference backend.
https://api.lighterhub.app/v1/chat/completions
Public HTTPS endpoint and tunnel routing.
OpenAI-compatible validation, auth, and billing logic.
Streaming completions with usage objects enforced.
Reserved GPU capacity for Qwen3.6 serving.