FPGA Inference API

Run open-source LLMs on FPGA hardware through an OpenAI-compatible API. Deterministic latency, lower energy per token, no GPU queuing delays.

OpenFPGA is the first cloud FPGA inference gateway. It provides an OpenAI-compatible API that runs large language models on Intel Agilex FPGA accelerators instead of GPUs. The result: deterministic latency, 5-20x higher tokens/s per watt, and 40-60% lower cost compared to equivalent GPU inference.

What Is FPGA Inference?

FPGA (Field-Programmable Gate Array) inference replaces the GPU in the AI inference pipeline with reconfigurable hardware. Unlike GPUs, which execute a fixed instruction set across thousands of identical cores, FPGAs are configured as custom hardware pipelines tailored to the specific model being served.

This means the hardware architecture itself is optimized for each model — custom memory hierarchies, arbitrary bit-width arithmetic, and bare-metal operation without OS or driver overhead. The result is a purpose-built inference engine that processes tokens through dedicated silicon pathways.

Why FPGAs Excel at LLM Inference

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). The decode phase is memory-bandwidth-bound at batch size 1, which is the regime where FPGAs have a structural advantage over GPUs.

MetricGPU (H100/H200)FPGA (Intel Agilex)
Tokens/s per wattBaseline5-20x higher
Latency consistencyVariable (queuing)Deterministic
Idle power drawHighLow
Bit-width flexibilityFP16/INT8/INT4Any bit-width
Memory architectureFixed HBM hierarchyCustom per model

Performance claims based on internal benchmarks and published research including FlightLLM (arXiv:2401.03868) and GLITCHES (Tsinghua University). See research section below.

The OpenFPGA API

OpenFPGA implements the standard OpenAI chat completions API. Switch from any OpenAI-compatible provider by changing your base URL:

curl https://app.openfpga.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENFPGA_API_KEY" \
  -d '{
    "model": "llama-3.1-8b-fpga",
    "messages": [
      {"role": "user", "content": "Explain FPGA inference in one sentence."}
    ]
  }'

Available Models

ModelModel IDHardware
Llama 3.1 8B Instructllama-3.1-8b-fpgaIntel Agilex

Additional models are being optimized for FPGA deployment. The API returns available models at GET /api/v1/models.

API Features

Hardware: Intel Agilex on Bittware IA-860M

OpenFPGA runs on Bittware IA-860M accelerator cards featuring Intel Agilex FPGAs. These cards provide:

Published Research

FPGA-based LLM inference is supported by peer-reviewed research demonstrating competitive or superior performance to GPU inference in specific regimes:

Integration

OpenFPGA is designed to be discovered and used by AI agents, developer tools, and orchestration frameworks:

Get Started with FPGA Inference

Try the OpenFPGA API with your existing OpenAI-compatible code. Change one line — your base URL.

Get API Key