Skip to content

Cherkaoui Docs

AI Setup

AI Setup Details (Planned)

Note: This AI infrastructure is currently planned and awaiting hardware acquisition. The specifications below represent the intended setup.

Hardware Configuration

Dedicated AI Hardware

GPU: 2x NVIDIA RTX 3090 (48GB VRAM)
CPU: AMD Ryzen 9 5900X (12 cores, 24 threads)
Storage: 2TB NVMe SSD dedicated to AI models and datasets
Cooling: Enhanced cooling solution for sustained workloads
Power Management: Configured for optimal performance/power ratio

Software Stack

Model Serving Infrastructure

Kubernetes: Single-node cluster for orchestration
OpenWebUI: Web interface with SSO integration (GitHub, Google, Microsoft)
Ollama: For large language models
A1111: For image generation

Self-Hosted Models

Large Language Models

Mistral 14B
Llama 4 17B
Phi-4
Qwen-coder 2.5 32B
Qwen 2.5 72B
DeepSeek-R1 70B
Gemma 3 27B
QwQ 32B
Qwen-VL 2.5 72B

Image Generation Models

Stable Diffusion: For high-quality image generation

Specialized Models

Whisper: For speech-to-text transcription
Embedding models: For semantic search and RAG applications
Classification models: For various automation tasks

Quantization Techniques

Model Optimization

GGUF Format: Conversion from original model formats
Quantization Levels: Experimentation with 4-bit to 16fp precision
Pruning: Selective neuron removal for model compression
KV Cache Optimization: For improved inference speed

Performance Benchmarks

Tokens per Second: Measurements across different models and quantization levels
Memory Usage: Tracking of VRAM and system RAM requirements
Quality Assessment: Evaluation of output quality vs. model size

Integration with Services

Web Interface

OpenWebUI: Kubernetes-hosted UI for interacting with models
Traefik Ingress: Secure HTTPS access with Let's Encrypt
API Access: REST API for programmatic access
SSO Authentication: Secure access control with multiple providers

Automation

Document Processing: Automated analysis of documents
Content Generation: Assistance for writing and creative tasks
Code Assistance: Integration with development environment

Knowledge Base

Vector Database: For storing and retrieving embeddings
Document Indexing: Processing of personal knowledge base
Retrieval Augmented Generation: Enhanced responses with personal data

Resource Management

Scheduling

Priority System: Resource allocation based on task importance
Queue Management: Handling of multiple inference requests
Batch Processing: Optimized for throughput when appropriate

Monitoring

Prometheus & Grafana: Comprehensive monitoring stack
Service Monitors: Custom monitoring for AI services
GPU Utilization: Real-time tracking of GPU usage
Inference Metrics: Response times and throughput
Error Tracking: Logging and alerting for model issues
AlertManager: Notifications for critical issues

Future Enhancements

Software Improvements

Model Distillation: Creating smaller, specialized models
Fine-tuning: Custom adaptations for specific use cases
Distributed Inference: Splitting workloads across multiple nodes

Integration Expansion

Voice Interface: Real-time speech interaction
Smart Home Integration: AI-powered automation
Personal Assistant: Comprehensive AI assistant with home lab awareness