ML Pipeline Orchestrator on Kubernetes

A production-ready MLOps platform for orchestrating machine learning workflows on Kubernetes, featuring automated training, versioning, and deployment.

Motivation

Managing ML workloads in production requires:
- Reproducible experiments
- Efficient resource utilization
- Automated deployment pipelines
- Model monitoring and rollback

This platform addresses these challenges with a cloud-native approach.

System Components

1. Airflow for Orchestration

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

dag = DAG(
    'ml_training_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily'
)

train = PythonOperator(
    task_id='train_model',
    python_callable=train_model,
    dag=dag
)

2. Kubernetes Jobs for Training

apiVersion: batch/v1
kind: Job
metadata:
  name: model-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: ml-trainer:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 16Gi

3. MLflow for Experiment Tracking

import mlflow

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 32})
    mlflow.log_metrics({"accuracy": 0.95, "loss": 0.23})
    mlflow.pytorch.log_model(model, "model")

Architecture

┌─────────────────────────────────────────────────┐
│              Airflow Scheduler                  │
│         (Workflow Orchestration)                │
└────────────────┬────────────────────────────────┘
                 │
        ┌────────┴────────┐
        │                 │
┌───────▼──────┐  ┌──────▼────────┐
│  Data Prep   │  │  Model Train  │
│  K8s Job     │  │  K8s Job      │
└───────┬──────┘  └──────┬────────┘
        │                 │
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │  MLflow Server  │
        │  (Tracking)     │
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │  Model Registry │
        │  (Versioning)   │
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │  Seldon Core    │
        │  (Serving)      │
        └─────────────────┘

Features

Auto-scaling

Horizontal Pod Autoscaler: Scale training jobs based on CPU/GPU usage
Cluster Autoscaler: Add nodes dynamically for large experiments
Vertical Pod Autoscaler: Optimize resource requests

Experiment Management

Centralized tracking of hyperparameters and metrics
Version control for datasets and models
Comparison of experiment results
Artifact storage (models, plots, logs)

Deployment Pipeline

Model trained and logged to MLflow
Automated testing (accuracy, latency, memory)
Promotion to staging environment
A/B testing against current production model
Gradual rollout to production
Monitoring and alerting

Resource Optimization

GPU sharing for small models
Spot instances for non-critical training
Dynamic resource allocation
Cost tracking per experiment

Setup

Prerequisites

Kubernetes cluster (v1.24+)
kubectl configured
Helm 3.x
Docker registry access

Installation

Deploy Infrastructure

# Add Helm repos
helm repo add apache-airflow https://airflow.apache.org
helm repo add mlflow https://community-charts.github.io/helm-charts

# Install Airflow
helm install airflow apache-airflow/airflow -f airflow-values.yaml

# Install MLflow
helm install mlflow mlflow/mlflow -f mlflow-values.yaml

Deploy Application

# Build images
docker build -t ml-pipeline:latest .

# Deploy to cluster
kubectl apply -f k8s/

Configure Airflow DAGs

kubectl cp dags/ airflow-scheduler-0:/opt/airflow/dags/

Usage Example

Train a Model

# Submit training job
from ml_pipeline import PipelineClient

client = PipelineClient(api_url="http://airflow.example.com")

run = client.submit_training(
    model_type="resnet50",
    dataset="imagenet",
    hyperparams={
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100
    },
    resources={
        "gpus": 4,
        "memory": "32Gi"
    }
)

print(f"Training job submitted: {run.id}")

Monitor Progress

# View job status
kubectl get jobs -l pipeline=ml-training

# Stream logs
kubectl logs -f job/model-training-abc123

# Check MLflow UI
open http://mlflow.example.com

Deploy Model

# Promote to production
client.promote_model(
    model_id="abc123",
    environment="production",
    traffic_percentage=10  # Canary deployment
)

Performance Metrics

Metric	Value
Pipeline Setup Time	< 5 min
Training Job Startup	< 30 sec
Model Deployment	< 2 min
Cost Savings (vs cloud ML platforms)	~40%

Monitoring

Prometheus Metrics

Training job duration
Resource utilization (CPU, GPU, Memory)
Model serving latency
Prediction error rates

Grafana Dashboards

Training pipeline overview
Cost analysis
Model performance comparison
Infrastructure health

Best Practices

Version Everything: Code, data, models, and configs
Test Pipelines: Unit tests for data processing and model logic
Resource Limits: Prevent jobs from consuming entire cluster
Secrets Management: Use Kubernetes secrets for credentials
Backup Models: Regular snapshots of model registry

Troubleshooting

Common Issues

Job fails with OOM

# Increase memory limits
kubectl edit job model-training-abc123

GPU not allocated

# Check node labels
kubectl get nodes --show-labels | grep nvidia

Airflow DAG not appearing

# Sync DAGs
kubectl exec -it airflow-scheduler-0 -- airflow dags list

Roadmap

[ ] Support for distributed training (Horovod, DeepSpeed)
[ ] Integration with Feast for feature store
[ ] Automated model drift detection
[ ] Multi-cloud support (AWS, GCP, Azure)
[ ] Cost optimization recommendations

Contributing

See CONTRIBUTING.md for contribution guidelines.

License

Apache 2.0 - See LICENSE

Overview