TGI & VLLM Support for Air-Gapped Environments

Forjinn provides comprehensive support for Text Generation Inference (TGI) and vLLM servers, enabling deployment in 100% air-gapped environments with complete offline functionality.

Overview

Air-gapped environments require complete network isolation while maintaining full LLM capabilities. Forjinn supports:

Text Generation Inference (TGI): HuggingFace's high-performance inference server
vLLM: High-throughput LLM serving with advanced features
Complete Offline Operation: No external internet access required
Model Management: Local model storage and loading
Security: Enterprise-grade isolation and security

Text Generation Inference (TGI) Support

Features

Air-Gapped Deployment: Complete offline operation
OpenAI-Compatible API: Standard OpenAI API format
Model Flexibility: Support for any HuggingFace model
Streaming Support: Real-time response streaming
Health Monitoring: Built-in health check endpoints
GPU Acceleration: CUDA and ROCm support

TGI Server Deployment

Docker Deployment

# Basic TGI deployment
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id microsoft/DialoGPT-medium \
  --num-shard 1 \
  --port 80 \
  --max-concurrent-requests 128 \
  --max-best-of 1 \
  --max-stop-sequences 6

Advanced Configuration

# Production TGI deployment with custom settings
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/models:/data \
  -v $PWD/cache:/tmp \
  -e HUGGING_FACE_HUB_TOKEN="" \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /data/my-custom-model \
  --num-shard 2 \
  --max-concurrent-requests 256 \
  --max-batch-prefill-tokens 4096 \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --waiting-served-ratio 1.2 \
  --max-waiting-tokens 20

TGI Configuration in Forjinn

// TGI Chat Model Configuration
{
  label: 'ChatTGI',
  name: 'chatTGI',
  description: 'Connect to Text Generation Inference (TGI) server',
  baseUrl: 'http://localhost:8080', // TGI server URL
  model: 'tgi', // Always 'tgi' for TGI servers
  temperature: 0.7,
  maxTokens: 2048,
  streaming: true
}

TGI API Integration

const obj: ChatOpenAIFields = {
  temperature: parseFloat(temperature) || 0.7,
  model: 'tgi',
  openAIApiKey: 'EMPTY', // TGI doesn't require API key
  streaming: streaming ?? true,
  configuration: {
    baseURL: `${tgiBaseUrl}/v1`,
    defaultHeaders: {
      'Content-Type': 'application/json'
    }
  }
}

vLLM Support

Features

High Performance: Optimized inference engine with PagedAttention
Multi-Model Support: Dynamic model loading and switching
Vision Models: Support for multimodal models (LLaVA, etc.)
Batching: Efficient continuous batching
Air-Gapped Operation: Complete offline deployment
Advanced Sampling: Multiple sampling algorithms

vLLM Server Deployment

Basic Deployment

# Basic vLLM deployment
python -m vllm.entrypoints.openai.api_server \
  --model microsoft/DialoGPT-medium \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my-model

Advanced Configuration

# Production vLLM deployment
python -m vllm.entrypoints.openai.api_server \
  --model /models/my-custom-model \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my-model \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --block-size 16 \
  --max-num-seqs 256 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --swap-space 4 \
  --disable-log-requests

Docker Deployment

# vLLM Docker deployment
docker run --gpus all \
  -v $PWD/models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/my-model \
  --host 0.0.0.0 \
  --port 8000

vLLM Configuration in Forjinn

// vLLM Chat Model Configuration
{
  label: 'ChatVLLM',
  name: 'chatVLLM',
  description: 'Connect to vLLM inference server',
  baseUrl: 'http://localhost:8000',
  modelName: 'my-model', // Actual model name from vLLM
  temperature: 0.7,
  maxTokens: 2048,
  streaming: true,
  allowImageUploads: false, // Enable for vision models
  imageResolution: 'low'
}

Model Discovery

// Automatic model discovery from vLLM server
async listModels(nodeData: INodeData): Promise<INodeOptionsValue[]> {
  const baseUrl = getCredentialParam('baseUrl', credentialData, nodeData)
  const modelsUrl = `${baseUrl}/v1/models`
  
  const response = await fetch(modelsUrl, {
    headers: { 'Content-Type': 'application/json' }
  })
  
  if (response.ok) {
    const data = await response.json()
    return data.data.map((model: any) => ({
      label: model.id,
      name: model.id
    }))
  }
  
  return []
}

Air-Gapped Environment Setup

Complete Docker Compose Configuration

version: '3.8'
services:
  forjinn:
    image: forjinn:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_PATH=/data/database.sqlite
      - PYTHON_RUNTIME=http://python-server:8000
      - TGI_ENDPOINT=http://tgi-server:80
      - VLLM_ENDPOINT=http://vllm-server:8000
    volumes:
      - ./data:/data
      - ./uploads:/app/uploads
    networks:
      - isolated
    depends_on:
      - python-server
      - tgi-server
      - vllm-server

  python-server:
    image: forjinn-python:latest
    ports:
      - "8000:8000"
    volumes:
      - ./python_venvs:/app/python_venvs
      - ./artifacts:/app/artifacts
    networks:
      - isolated

  tgi-server:
    image: ghcr.io/huggingface/text-generation-inference:latest
    ports:
      - "8080:80"
    volumes:
      - ./models:/data
      - ./tgi-cache:/tmp
    environment:
      - MODEL_ID=/data/my-model
      - NUM_SHARD=1
      - MAX_CONCURRENT_REQUESTS=128
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - isolated

  vllm-server:
    image: vllm/vllm-openai:latest
    ports:
      - "8081:8000"
    volumes:
      - ./models:/models
      - ./vllm-cache:/root/.cache
    command: >
      --model /models/my-model
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - isolated

  # Optional: Local model registry
  model-registry:
    image: nginx:alpine
    ports:
      - "8082:80"
    volumes:
      - ./models:/usr/share/nginx/html/models:ro
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - isolated

networks:
  isolated:
    driver: bridge
    internal: true  # No external internet access

Model Management for Air-Gapped Deployment

Pre-download Models

# Download models before air-gapped deployment
huggingface-cli download microsoft/DialoGPT-medium --local-dir ./models/DialoGPT-medium
huggingface-cli download microsoft/DialoGPT-large --local-dir ./models/DialoGPT-large
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir ./models/Llama-2-7b-chat
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.1 --local-dir ./models/Mistral-7B-Instruct

Model Directory Structure

models/
├── DialoGPT-medium/
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── Llama-2-7b-chat/
│   ├── config.json
│   ├── pytorch_model-00001-of-00002.bin
│   ├── pytorch_model-00002-of-00002.bin
│   └── tokenizer.model
└── Mistral-7B-Instruct/
    ├── config.json
    ├── model-00001-of-00002.safetensors
    ├── model-00002-of-00002.safetensors
    └── tokenizer.json

Security Features

Network Isolation

# Complete network isolation
networks:
  isolated:
    driver: bridge
    internal: true
    ipam:
      config:
        - subnet: 172.20.0.0/16

Authentication & Authorization

// API Key Authentication for vLLM
const obj: ChatOpenAIFields = {
  openAIApiKey: vllmApiKey || 'EMPTY',
  configuration: {
    baseURL: vllmBaseUrl,
    defaultHeaders: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    }
  }
}

Resource Management

# Resource limits and reservations
deploy:
  resources:
    limits:
      memory: 16G
      cpus: '8'
    reservations:
      memory: 8G
      cpus: '4'
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Performance Optimization

TGI Optimization

# Optimized TGI configuration
--max-concurrent-requests 256
--max-batch-prefill-tokens 4096
--max-input-length 2048
--max-total-tokens 4096
--waiting-served-ratio 1.2
--max-waiting-tokens 20

vLLM Optimization

# Optimized vLLM configuration
--tensor-parallel-size 2
--pipeline-parallel-size 1
--block-size 16
--max-num-seqs 256
--max-model-len 4096
--gpu-memory-utilization 0.9
--swap-space 4

Hardware Requirements

Minimum Requirements

CPU: 8 cores, 2.4GHz
RAM: 32GB
GPU: NVIDIA RTX 3080 (10GB VRAM) or equivalent
Storage: 500GB SSD

Recommended Requirements

CPU: 16 cores, 3.0GHz
RAM: 64GB
GPU: NVIDIA A100 (40GB VRAM) or equivalent
Storage: 1TB NVMe SSD

Monitoring & Maintenance

Health Checks

# TGI health check
curl http://localhost:8080/health

# vLLM health check
curl http://localhost:8000/health

Metrics Collection

# Prometheus metrics (optional)
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - isolated

Log Management

# Centralized logging
logging:
  driver: "json-file"
  options:
    max-size: "100m"
    max-file: "3"

Troubleshooting

Common Issues

Model loading failures: Check model path and permissions
GPU memory errors: Adjust memory utilization settings
Connection timeouts: Increase timeout values
Performance issues: Optimize batch sizes and concurrency

Debug Commands

# Check GPU usage
nvidia-smi

# Monitor container logs
docker logs -f tgi-server
docker logs -f vllm-server

# Test API endpoints
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tgi", "messages": [{"role": "user", "content": "Hello"}]}'

This comprehensive setup enables complete air-gapped LLM deployment with enterprise-grade security and performance.

Forjinn Docs

Air Gapped Llm Support

TGI & VLLM Support for Air-Gapped Environments

Overview

Text Generation Inference (TGI) Support

Features

TGI Server Deployment

Docker Deployment

Advanced Configuration

TGI Configuration in Forjinn

TGI API Integration

vLLM Support

Features

vLLM Server Deployment

Basic Deployment

Advanced Configuration

Docker Deployment

vLLM Configuration in Forjinn

Model Discovery

Air-Gapped Environment Setup

Complete Docker Compose Configuration

Model Management for Air-Gapped Deployment

Pre-download Models

Model Directory Structure

Security Features

Network Isolation

Authentication & Authorization

Resource Management

Performance Optimization

TGI Optimization

vLLM Optimization

Hardware Requirements

Minimum Requirements

Recommended Requirements

Monitoring & Maintenance

Health Checks

Metrics Collection

Log Management

Troubleshooting

Common Issues

Debug Commands

Related Documentation

Related Documentation

Chatflows

Agents

Tools