What is an FPGA inference API?

An FPGA inference API is a cloud service that runs AI model inference on FPGA (Field-Programmable Gate Array) hardware instead of GPUs. OpenFPGA provides an OpenAI-compatible API endpoint that routes inference requests to Intel Agilex FPGA accelerators, delivering deterministic latency and higher energy efficiency than GPU-based alternatives.

How does FPGA inference compare to GPU inference for LLMs?

FPGAs excel at the decode phase of LLM inference (autoregressive token generation), which is memory-bandwidth-bound at batch size 1. Research shows FPGAs can achieve 5-20x higher tokens per second per watt compared to H100/H200 GPUs. FPGAs also provide deterministic latency without GPU queuing delays, custom memory hierarchies, and arbitrary bit-width arithmetic for efficient quantization.

Is the OpenFPGA API compatible with OpenAI's API format?

Yes. OpenFPGA implements the OpenAI-compatible chat completions API. You can switch from any OpenAI-compatible provider by changing the base URL to https://app.openfpga.ai/api/v1 and using your OpenFPGA API key. No code changes required.

FPGA Inference API

Run open-source LLMs on FPGA hardware through an OpenAI-compatible API. Deterministic latency, lower energy per token, no GPU queuing delays.

OpenFPGA is the first cloud FPGA inference gateway. It provides an OpenAI-compatible API that runs large language models on Intel Agilex FPGA accelerators instead of GPUs. The result: deterministic latency, 5-20x higher tokens/s per watt, and 40-60% lower cost compared to equivalent GPU inference.

What Is FPGA Inference?

FPGA (Field-Programmable Gate Array) inference replaces the GPU in the AI inference pipeline with reconfigurable hardware. Unlike GPUs, which execute a fixed instruction set across thousands of identical cores, FPGAs are configured as custom hardware pipelines tailored to the specific model being served.

This means the hardware architecture itself is optimized for each model — custom memory hierarchies, arbitrary bit-width arithmetic, and bare-metal operation without OS or driver overhead. The result is a purpose-built inference engine that processes tokens through dedicated silicon pathways.

Why FPGAs Excel at LLM Inference

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). The decode phase is memory-bandwidth-bound at batch size 1, which is the regime where FPGAs have a structural advantage over GPUs.

Metric	GPU (H100/H200)	FPGA (Intel Agilex)
Tokens/s per watt	Baseline	5-20x higher
Latency consistency	Variable (queuing)	Deterministic
Idle power draw	High	Low
Bit-width flexibility	FP16/INT8/INT4	Any bit-width
Memory architecture	Fixed HBM hierarchy	Custom per model

Performance claims based on internal benchmarks and published research including FlightLLM (arXiv:2401.03868) and GLITCHES (Tsinghua University). See research section below.

The OpenFPGA API

OpenFPGA implements the standard OpenAI chat completions API. Switch from any OpenAI-compatible provider by changing your base URL:

curl https://app.openfpga.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENFPGA_API_KEY" \
  -d '{
    "model": "llama-3.1-8b-fpga",
    "messages": [
      {"role": "user", "content": "Explain FPGA inference in one sentence."}
    ]
  }'

Available Models

Model	Model ID	Hardware
Llama 3.1 8B Instruct	`llama-3.1-8b-fpga`	Intel Agilex

Additional models are being optimized for FPGA deployment. The API returns available models at GET /api/v1/models.

API Features

OpenAI-compatible chat completions (POST /api/v1/chat/completions)
Streaming responses (SSE)
Function calling and tool use
Structured JSON output
Standard error codes with actionable messages

Hardware: Intel Agilex on Bittware IA-860M

OpenFPGA runs on Bittware IA-860M accelerator cards featuring Intel Agilex FPGAs. These cards provide:

High-bandwidth memory (HBM2E) for model weights and KV cache
PCIe Gen5 connectivity to host CPUs (Intel Xeon / AMD EPYC)
Custom dataflow pipelines synthesized per model architecture
Bare-metal operation — no GPU driver stack, no CUDA, no scheduling overhead

Published Research

FPGA-based LLM inference is supported by peer-reviewed research demonstrating competitive or superior performance to GPU inference in specific regimes:

FlightLLM (arXiv:2401.03868) — Configurable sparse acceleration for LLMs on FPGA, demonstrating that custom dataflow and sparsity patterns on FPGA can match or exceed GPU throughput for specific model configurations.
GLITCHES (Tsinghua University) — Heterogeneous FPGA acceleration framework showing energy-efficient LLM serving with custom memory management.
Positron Atlas (Altera Agilex-7M) — 70% more tokens/sec than NVIDIA Hopper at 3.5x performance per watt for decode-phase workloads.
LoopLynx (dual-FPGA architecture) — 2.52x speedup over A100, consuming only 48.1% of the energy for equivalent throughput.

Integration

OpenFPGA is designed to be discovered and used by AI agents, developer tools, and orchestration frameworks:

llms.txt — Machine-readable service summary for LLMs
openapi.json — Full OpenAPI 3.1 specification
ai-plugin.json — OpenAI-compatible agent plugin manifest
agent.json — Google A2A agent discovery card
AGENTS.md — Instructions for coding agents

Get Started with FPGA Inference

Try the OpenFPGA API with your existing OpenAI-compatible code. Change one line — your base URL.

Get API Key