Forjinn Docs

Development Platform

Documentation v2.0
Made with
by Forjinn

Air Gapped Llm Support

Learn about air gapped llm support and how to implement it effectively.

6 min read
🆕Recently updated
Last updated: 12/9/2025

TGI & VLLM Support for Air-Gapped Environments

Forjinn provides comprehensive support for Text Generation Inference (TGI) and vLLM servers, enabling deployment in 100% air-gapped environments with complete offline functionality.

Overview

Air-gapped environments require complete network isolation while maintaining full LLM capabilities. Forjinn supports:

  • Text Generation Inference (TGI): HuggingFace's high-performance inference server
  • vLLM: High-throughput LLM serving with advanced features
  • Complete Offline Operation: No external internet access required
  • Model Management: Local model storage and loading
  • Security: Enterprise-grade isolation and security

Text Generation Inference (TGI) Support

Features

  • Air-Gapped Deployment: Complete offline operation
  • OpenAI-Compatible API: Standard OpenAI API format
  • Model Flexibility: Support for any HuggingFace model
  • Streaming Support: Real-time response streaming
  • Health Monitoring: Built-in health check endpoints
  • GPU Acceleration: CUDA and ROCm support

TGI Server Deployment

Docker Deployment

# Basic TGI deployment
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id microsoft/DialoGPT-medium \
  --num-shard 1 \
  --port 80 \
  --max-concurrent-requests 128 \
  --max-best-of 1 \
  --max-stop-sequences 6

Advanced Configuration

# Production TGI deployment with custom settings
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/models:/data \
  -v $PWD/cache:/tmp \
  -e HUGGING_FACE_HUB_TOKEN="" \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /data/my-custom-model \
  --num-shard 2 \
  --max-concurrent-requests 256 \
  --max-batch-prefill-tokens 4096 \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --waiting-served-ratio 1.2 \
  --max-waiting-tokens 20

TGI Configuration in Forjinn

// TGI Chat Model Configuration
{
  label: 'ChatTGI',
  name: 'chatTGI',
  description: 'Connect to Text Generation Inference (TGI) server',
  baseUrl: 'http://localhost:8080', // TGI server URL
  model: 'tgi', // Always 'tgi' for TGI servers
  temperature: 0.7,
  maxTokens: 2048,
  streaming: true
}

TGI API Integration

const obj: ChatOpenAIFields = {
  temperature: parseFloat(temperature) || 0.7,
  model: 'tgi',
  openAIApiKey: 'EMPTY', // TGI doesn't require API key
  streaming: streaming ?? true,
  configuration: {
    baseURL: `${tgiBaseUrl}/v1`,
    defaultHeaders: {
      'Content-Type': 'application/json'
    }
  }
}

vLLM Support

Features

  • High Performance: Optimized inference engine with PagedAttention
  • Multi-Model Support: Dynamic model loading and switching
  • Vision Models: Support for multimodal models (LLaVA, etc.)
  • Batching: Efficient continuous batching
  • Air-Gapped Operation: Complete offline deployment
  • Advanced Sampling: Multiple sampling algorithms

vLLM Server Deployment

Basic Deployment

# Basic vLLM deployment
python -m vllm.entrypoints.openai.api_server \
  --model microsoft/DialoGPT-medium \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my-model

Advanced Configuration

# Production vLLM deployment
python -m vllm.entrypoints.openai.api_server \
  --model /models/my-custom-model \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my-model \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --block-size 16 \
  --max-num-seqs 256 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --swap-space 4 \
  --disable-log-requests

Docker Deployment

# vLLM Docker deployment
docker run --gpus all \
  -v $PWD/models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/my-model \
  --host 0.0.0.0 \
  --port 8000

vLLM Configuration in Forjinn

// vLLM Chat Model Configuration
{
  label: 'ChatVLLM',
  name: 'chatVLLM',
  description: 'Connect to vLLM inference server',
  baseUrl: 'http://localhost:8000',
  modelName: 'my-model', // Actual model name from vLLM
  temperature: 0.7,
  maxTokens: 2048,
  streaming: true,
  allowImageUploads: false, // Enable for vision models
  imageResolution: 'low'
}

Model Discovery

// Automatic model discovery from vLLM server
async listModels(nodeData: INodeData): Promise<INodeOptionsValue[]> {
  const baseUrl = getCredentialParam('baseUrl', credentialData, nodeData)
  const modelsUrl = `${baseUrl}/v1/models`
  
  const response = await fetch(modelsUrl, {
    headers: { 'Content-Type': 'application/json' }
  })
  
  if (response.ok) {
    const data = await response.json()
    return data.data.map((model: any) => ({
      label: model.id,
      name: model.id
    }))
  }
  
  return []
}

Air-Gapped Environment Setup

Complete Docker Compose Configuration

version: '3.8'
services:
  forjinn:
    image: forjinn:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_PATH=/data/database.sqlite
      - PYTHON_RUNTIME=http://python-server:8000
      - TGI_ENDPOINT=http://tgi-server:80
      - VLLM_ENDPOINT=http://vllm-server:8000
    volumes:
      - ./data:/data
      - ./uploads:/app/uploads
    networks:
      - isolated
    depends_on:
      - python-server
      - tgi-server
      - vllm-server

  python-server:
    image: forjinn-python:latest
    ports:
      - "8000:8000"
    volumes:
      - ./python_venvs:/app/python_venvs
      - ./artifacts:/app/artifacts
    networks:
      - isolated

  tgi-server:
    image: ghcr.io/huggingface/text-generation-inference:latest
    ports:
      - "8080:80"
    volumes:
      - ./models:/data
      - ./tgi-cache:/tmp
    environment:
      - MODEL_ID=/data/my-model
      - NUM_SHARD=1
      - MAX_CONCURRENT_REQUESTS=128
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - isolated

  vllm-server:
    image: vllm/vllm-openai:latest
    ports:
      - "8081:8000"
    volumes:
      - ./models:/models
      - ./vllm-cache:/root/.cache
    command: >
      --model /models/my-model
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - isolated

  # Optional: Local model registry
  model-registry:
    image: nginx:alpine
    ports:
      - "8082:80"
    volumes:
      - ./models:/usr/share/nginx/html/models:ro
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - isolated

networks:
  isolated:
    driver: bridge
    internal: true  # No external internet access

Model Management for Air-Gapped Deployment

Pre-download Models

# Download models before air-gapped deployment
huggingface-cli download microsoft/DialoGPT-medium --local-dir ./models/DialoGPT-medium
huggingface-cli download microsoft/DialoGPT-large --local-dir ./models/DialoGPT-large
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir ./models/Llama-2-7b-chat
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.1 --local-dir ./models/Mistral-7B-Instruct

Model Directory Structure

models/
├── DialoGPT-medium/
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── Llama-2-7b-chat/
│   ├── config.json
│   ├── pytorch_model-00001-of-00002.bin
│   ├── pytorch_model-00002-of-00002.bin
│   └── tokenizer.model
└── Mistral-7B-Instruct/
    ├── config.json
    ├── model-00001-of-00002.safetensors
    ├── model-00002-of-00002.safetensors
    └── tokenizer.json

Security Features

Network Isolation

# Complete network isolation
networks:
  isolated:
    driver: bridge
    internal: true
    ipam:
      config:
        - subnet: 172.20.0.0/16

Authentication & Authorization

// API Key Authentication for vLLM
const obj: ChatOpenAIFields = {
  openAIApiKey: vllmApiKey || 'EMPTY',
  configuration: {
    baseURL: vllmBaseUrl,
    defaultHeaders: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    }
  }
}

Resource Management

# Resource limits and reservations
deploy:
  resources:
    limits:
      memory: 16G
      cpus: '8'
    reservations:
      memory: 8G
      cpus: '4'
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Performance Optimization

TGI Optimization

# Optimized TGI configuration
--max-concurrent-requests 256
--max-batch-prefill-tokens 4096
--max-input-length 2048
--max-total-tokens 4096
--waiting-served-ratio 1.2
--max-waiting-tokens 20

vLLM Optimization

# Optimized vLLM configuration
--tensor-parallel-size 2
--pipeline-parallel-size 1
--block-size 16
--max-num-seqs 256
--max-model-len 4096
--gpu-memory-utilization 0.9
--swap-space 4

Hardware Requirements

Minimum Requirements

  • CPU: 8 cores, 2.4GHz
  • RAM: 32GB
  • GPU: NVIDIA RTX 3080 (10GB VRAM) or equivalent
  • Storage: 500GB SSD

Recommended Requirements

  • CPU: 16 cores, 3.0GHz
  • RAM: 64GB
  • GPU: NVIDIA A100 (40GB VRAM) or equivalent
  • Storage: 1TB NVMe SSD

Monitoring & Maintenance

Health Checks

# TGI health check
curl http://localhost:8080/health

# vLLM health check
curl http://localhost:8000/health

Metrics Collection

# Prometheus metrics (optional)
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - isolated

Log Management

# Centralized logging
logging:
  driver: "json-file"
  options:
    max-size: "100m"
    max-file: "3"

Troubleshooting

Common Issues

  1. Model loading failures: Check model path and permissions
  2. GPU memory errors: Adjust memory utilization settings
  3. Connection timeouts: Increase timeout values
  4. Performance issues: Optimize batch sizes and concurrency

Debug Commands

# Check GPU usage
nvidia-smi

# Monitor container logs
docker logs -f tgi-server
docker logs -f vllm-server

# Test API endpoints
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'

This comprehensive setup enables complete air-gapped LLM deployment with enterprise-grade security and performance.