FPGA Inference API
Run open-source LLMs on FPGA hardware through an OpenAI-compatible API. Deterministic latency, lower energy per token, no GPU queuing delays.
What Is FPGA Inference?
FPGA (Field-Programmable Gate Array) inference replaces the GPU in the AI inference pipeline with reconfigurable hardware. Unlike GPUs, which execute a fixed instruction set across thousands of identical cores, FPGAs are configured as custom hardware pipelines tailored to the specific model being served.
This means the hardware architecture itself is optimized for each model — custom memory hierarchies, arbitrary bit-width arithmetic, and bare-metal operation without OS or driver overhead. The result is a purpose-built inference engine that processes tokens through dedicated silicon pathways.
Why FPGAs Excel at LLM Inference
LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). The decode phase is memory-bandwidth-bound at batch size 1, which is the regime where FPGAs have a structural advantage over GPUs.
| Metric | GPU (H100/H200) | FPGA (Intel Agilex) |
|---|---|---|
| Tokens/s per watt | Baseline | 5-20x higher |
| Latency consistency | Variable (queuing) | Deterministic |
| Idle power draw | High | Low |
| Bit-width flexibility | FP16/INT8/INT4 | Any bit-width |
| Memory architecture | Fixed HBM hierarchy | Custom per model |
Performance claims based on internal benchmarks and published research including FlightLLM (arXiv:2401.03868) and GLITCHES (Tsinghua University). See research section below.
The OpenFPGA API
OpenFPGA implements the standard OpenAI chat completions API. Switch from any OpenAI-compatible provider by changing your base URL:
curl https://app.openfpga.ai/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENFPGA_API_KEY" \
-d '{
"model": "llama-3.1-8b-fpga",
"messages": [
{"role": "user", "content": "Explain FPGA inference in one sentence."}
]
}'
Available Models
| Model | Model ID | Hardware |
|---|---|---|
| Llama 3.1 8B Instruct | llama-3.1-8b-fpga | Intel Agilex |
Additional models are being optimized for FPGA deployment. The API returns available models at GET /api/v1/models.
API Features
- OpenAI-compatible chat completions (
POST /api/v1/chat/completions) - Streaming responses (SSE)
- Function calling and tool use
- Structured JSON output
- Standard error codes with actionable messages
Hardware: Intel Agilex on Bittware IA-860M
OpenFPGA runs on Bittware IA-860M accelerator cards featuring Intel Agilex FPGAs. These cards provide:
- High-bandwidth memory (HBM2E) for model weights and KV cache
- PCIe Gen5 connectivity to host CPUs (Intel Xeon / AMD EPYC)
- Custom dataflow pipelines synthesized per model architecture
- Bare-metal operation — no GPU driver stack, no CUDA, no scheduling overhead
Published Research
FPGA-based LLM inference is supported by peer-reviewed research demonstrating competitive or superior performance to GPU inference in specific regimes:
- FlightLLM (arXiv:2401.03868) — Configurable sparse acceleration for LLMs on FPGA, demonstrating that custom dataflow and sparsity patterns on FPGA can match or exceed GPU throughput for specific model configurations.
- GLITCHES (Tsinghua University) — Heterogeneous FPGA acceleration framework showing energy-efficient LLM serving with custom memory management.
- Positron Atlas (Altera Agilex-7M) — 70% more tokens/sec than NVIDIA Hopper at 3.5x performance per watt for decode-phase workloads.
- LoopLynx (dual-FPGA architecture) — 2.52x speedup over A100, consuming only 48.1% of the energy for equivalent throughput.
Integration
OpenFPGA is designed to be discovered and used by AI agents, developer tools, and orchestration frameworks:
- llms.txt — Machine-readable service summary for LLMs
- openapi.json — Full OpenAPI 3.1 specification
- ai-plugin.json — OpenAI-compatible agent plugin manifest
- agent.json — Google A2A agent discovery card
- AGENTS.md — Instructions for coding agents
Get Started with FPGA Inference
Try the OpenFPGA API with your existing OpenAI-compatible code. Change one line — your base URL.
Get API Key