Air Gapped Llm Support
Learn about air gapped llm support and how to implement it effectively.
6 min read
🆕Recently updated
Last updated: 12/9/2025
TGI & VLLM Support for Air-Gapped Environments
Forjinn provides comprehensive support for Text Generation Inference (TGI) and vLLM servers, enabling deployment in 100% air-gapped environments with complete offline functionality.
Overview
Air-gapped environments require complete network isolation while maintaining full LLM capabilities. Forjinn supports:
- Text Generation Inference (TGI): HuggingFace's high-performance inference server
- vLLM: High-throughput LLM serving with advanced features
- Complete Offline Operation: No external internet access required
- Model Management: Local model storage and loading
- Security: Enterprise-grade isolation and security
Text Generation Inference (TGI) Support
Features
- Air-Gapped Deployment: Complete offline operation
- OpenAI-Compatible API: Standard OpenAI API format
- Model Flexibility: Support for any HuggingFace model
- Streaming Support: Real-time response streaming
- Health Monitoring: Built-in health check endpoints
- GPU Acceleration: CUDA and ROCm support
TGI Server Deployment
Docker Deployment
# Basic TGI deployment
docker run --gpus all \
-p 8080:80 \
-v $PWD/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id microsoft/DialoGPT-medium \
--num-shard 1 \
--port 80 \
--max-concurrent-requests 128 \
--max-best-of 1 \
--max-stop-sequences 6
Advanced Configuration
# Production TGI deployment with custom settings
docker run --gpus all \
-p 8080:80 \
-v $PWD/models:/data \
-v $PWD/cache:/tmp \
-e HUGGING_FACE_HUB_TOKEN="" \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /data/my-custom-model \
--num-shard 2 \
--max-concurrent-requests 256 \
--max-batch-prefill-tokens 4096 \
--max-input-length 2048 \
--max-total-tokens 4096 \
--waiting-served-ratio 1.2 \
--max-waiting-tokens 20
TGI Configuration in Forjinn
// TGI Chat Model Configuration
{
label: 'ChatTGI',
name: 'chatTGI',
description: 'Connect to Text Generation Inference (TGI) server',
baseUrl: 'http://localhost:8080', // TGI server URL
model: 'tgi', // Always 'tgi' for TGI servers
temperature: 0.7,
maxTokens: 2048,
streaming: true
}
TGI API Integration
const obj: ChatOpenAIFields = {
temperature: parseFloat(temperature) || 0.7,
model: 'tgi',
openAIApiKey: 'EMPTY', // TGI doesn't require API key
streaming: streaming ?? true,
configuration: {
baseURL: `${tgiBaseUrl}/v1`,
defaultHeaders: {
'Content-Type': 'application/json'
}
}
}
vLLM Support
Features
- High Performance: Optimized inference engine with PagedAttention
- Multi-Model Support: Dynamic model loading and switching
- Vision Models: Support for multimodal models (LLaVA, etc.)
- Batching: Efficient continuous batching
- Air-Gapped Operation: Complete offline deployment
- Advanced Sampling: Multiple sampling algorithms
vLLM Server Deployment
Basic Deployment
# Basic vLLM deployment
python -m vllm.entrypoints.openai.api_server \
--model microsoft/DialoGPT-medium \
--host 0.0.0.0 \
--port 8000 \
--served-model-name my-model
Advanced Configuration
# Production vLLM deployment
python -m vllm.entrypoints.openai.api_server \
--model /models/my-custom-model \
--host 0.0.0.0 \
--port 8000 \
--served-model-name my-model \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--block-size 16 \
--max-num-seqs 256 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--swap-space 4 \
--disable-log-requests
Docker Deployment
# vLLM Docker deployment
docker run --gpus all \
-v $PWD/models:/models \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model /models/my-model \
--host 0.0.0.0 \
--port 8000
vLLM Configuration in Forjinn
// vLLM Chat Model Configuration
{
label: 'ChatVLLM',
name: 'chatVLLM',
description: 'Connect to vLLM inference server',
baseUrl: 'http://localhost:8000',
modelName: 'my-model', // Actual model name from vLLM
temperature: 0.7,
maxTokens: 2048,
streaming: true,
allowImageUploads: false, // Enable for vision models
imageResolution: 'low'
}
Model Discovery
// Automatic model discovery from vLLM server
async listModels(nodeData: INodeData): Promise<INodeOptionsValue[]> {
const baseUrl = getCredentialParam('baseUrl', credentialData, nodeData)
const modelsUrl = `${baseUrl}/v1/models`
const response = await fetch(modelsUrl, {
headers: { 'Content-Type': 'application/json' }
})
if (response.ok) {
const data = await response.json()
return data.data.map((model: any) => ({
label: model.id,
name: model.id
}))
}
return []
}
Air-Gapped Environment Setup
Complete Docker Compose Configuration
version: '3.8'
services:
forjinn:
image: forjinn:latest
ports:
- "3000:3000"
environment:
- DATABASE_PATH=/data/database.sqlite
- PYTHON_RUNTIME=http://python-server:8000
- TGI_ENDPOINT=http://tgi-server:80
- VLLM_ENDPOINT=http://vllm-server:8000
volumes:
- ./data:/data
- ./uploads:/app/uploads
networks:
- isolated
depends_on:
- python-server
- tgi-server
- vllm-server
python-server:
image: forjinn-python:latest
ports:
- "8000:8000"
volumes:
- ./python_venvs:/app/python_venvs
- ./artifacts:/app/artifacts
networks:
- isolated
tgi-server:
image: ghcr.io/huggingface/text-generation-inference:latest
ports:
- "8080:80"
volumes:
- ./models:/data
- ./tgi-cache:/tmp
environment:
- MODEL_ID=/data/my-model
- NUM_SHARD=1
- MAX_CONCURRENT_REQUESTS=128
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- isolated
vllm-server:
image: vllm/vllm-openai:latest
ports:
- "8081:8000"
volumes:
- ./models:/models
- ./vllm-cache:/root/.cache
command: >
--model /models/my-model
--host 0.0.0.0
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- isolated
# Optional: Local model registry
model-registry:
image: nginx:alpine
ports:
- "8082:80"
volumes:
- ./models:/usr/share/nginx/html/models:ro
- ./nginx.conf:/etc/nginx/nginx.conf:ro
networks:
- isolated
networks:
isolated:
driver: bridge
internal: true # No external internet access
Model Management for Air-Gapped Deployment
Pre-download Models
# Download models before air-gapped deployment
huggingface-cli download microsoft/DialoGPT-medium --local-dir ./models/DialoGPT-medium
huggingface-cli download microsoft/DialoGPT-large --local-dir ./models/DialoGPT-large
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir ./models/Llama-2-7b-chat
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.1 --local-dir ./models/Mistral-7B-Instruct
Model Directory Structure
models/
├── DialoGPT-medium/
│ ├── config.json
│ ├── pytorch_model.bin
│ ├── tokenizer.json
│ └── tokenizer_config.json
├── Llama-2-7b-chat/
│ ├── config.json
│ ├── pytorch_model-00001-of-00002.bin
│ ├── pytorch_model-00002-of-00002.bin
│ └── tokenizer.model
└── Mistral-7B-Instruct/
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
└── tokenizer.json
Security Features
Network Isolation
# Complete network isolation
networks:
isolated:
driver: bridge
internal: true
ipam:
config:
- subnet: 172.20.0.0/16
Authentication & Authorization
// API Key Authentication for vLLM
const obj: ChatOpenAIFields = {
openAIApiKey: vllmApiKey || 'EMPTY',
configuration: {
baseURL: vllmBaseUrl,
defaultHeaders: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
}
Resource Management
# Resource limits and reservations
deploy:
resources:
limits:
memory: 16G
cpus: '8'
reservations:
memory: 8G
cpus: '4'
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Performance Optimization
TGI Optimization
# Optimized TGI configuration
--max-concurrent-requests 256
--max-batch-prefill-tokens 4096
--max-input-length 2048
--max-total-tokens 4096
--waiting-served-ratio 1.2
--max-waiting-tokens 20
vLLM Optimization
# Optimized vLLM configuration
--tensor-parallel-size 2
--pipeline-parallel-size 1
--block-size 16
--max-num-seqs 256
--max-model-len 4096
--gpu-memory-utilization 0.9
--swap-space 4
Hardware Requirements
Minimum Requirements
- CPU: 8 cores, 2.4GHz
- RAM: 32GB
- GPU: NVIDIA RTX 3080 (10GB VRAM) or equivalent
- Storage: 500GB SSD
Recommended Requirements
- CPU: 16 cores, 3.0GHz
- RAM: 64GB
- GPU: NVIDIA A100 (40GB VRAM) or equivalent
- Storage: 1TB NVMe SSD
Monitoring & Maintenance
Health Checks
# TGI health check
curl http://localhost:8080/health
# vLLM health check
curl http://localhost:8000/health
Metrics Collection
# Prometheus metrics (optional)
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
networks:
- isolated
Log Management
# Centralized logging
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "3"
Troubleshooting
Common Issues
- Model loading failures: Check model path and permissions
- GPU memory errors: Adjust memory utilization settings
- Connection timeouts: Increase timeout values
- Performance issues: Optimize batch sizes and concurrency
Debug Commands
# Check GPU usage
nvidia-smi
# Monitor container logs
docker logs -f tgi-server
docker logs -f vllm-server
# Test API endpoints
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'
This comprehensive setup enables complete air-gapped LLM deployment with enterprise-grade security and performance.