A Software Engineer’s Journey into ML Deployment : Discovering Kubeflow

Ahmad Hassan
6 min readDec 10, 2024

--

Introduction

As a software engineer, I’ve always been fascinated by the ways cloud technology has evolved over the past decade to solve complex infrastructure issues. Recently, while working on a personal project with an ML module, I was exploring options for deploying my ML model. The first problem I faced was understanding why do we even need ML-specific solutions in the first place.

Why not just use Kubernetes for ML model deployment? Isn’t Kubernetes the all-in-one solution? Initially, I thought, “Why do we even need this abstraction when Kubernetes already exists?”

This scenario challenged my current cloud knowledge and pushed me to dive deeper into MLOps, a field I knew little about before. While exploring MLOps, I came across several cloud solutions such as TensorFlow Extended (TFX), AWS Neuron, and Google’s Vertex AI. However, the one whose architecture intrigued me the most was Kubeflow.

Different solutions for MLOPs

I chose Kubeflow because it resonated the most with my idea of an ML deployment solution. From what I understood, Kubeflow builds on top of existing Kubernetes components, optimizing them specifically for the ML lifecycle.

As I delved deeper into the ML lifecycle and experimented with deploying ML models, I realized how vastly different ML applications are from web applications. The gap Kubeflow bridges isn’t just an abstraction — it makes deploying ML applications significantly easier compared to Kubernetes.

This blog shares my journey of discovering Kubeflow, learning its purpose, and deploying my first ML pipeline, complete with diagrams and a beginner-friendly example.

How ML Deployment solutions build upon existing solutions

Understanding the ML Lifecycle: A Paradigm Shift for Software Engineers

How Web Applications Work

Web applications typically follow a simple, linear lifecycle:

  1. Develop: Write code and create the application.
  2. Deploy: Push the application to a server or cloud environment.
  3. Maintain: Monitor logs and fix bugs as they arise.

This works well because web apps don’t usually require constant re-training or re-building once deployed.

The lifecycle of a web application

The Complexity of ML Applications

ML applications are fundamentally different. They follow an iterative, cyclical lifecycle:

  1. Data Collection: Gather and preprocess massive amounts of data.
  2. Model Training: Train ML models, often requiring distributed computing.
  3. Hyperparameter Tuning: Optimize model performance through multiple runs.
  4. Deployment: Push the model into production.
  5. Monitoring & Retraining: Continuously monitor performance and retrain with new data.
A high level lifecycle of a ML Application

Each stage involves heavy compute, automation, and scaling, which makes traditional DevOps tools like Kubernetes fall short for ML-specific needs.

Where Kubernetes Falls Short for Machine Learning

While Kubernetes excels in container orchestration, it doesn’t natively address the unique challenges of the ML lifecycle:

  1. Distributed Training: ML models often require distributed training using frameworks like TensorFlow or PyTorch. Kubernetes doesn’t provide built-in support for this.
  2. Pipeline Orchestration: Defining, automating, and tracking ML workflows is tedious and error-prone in Kubernetes.
  3. Hyperparameter Tuning: Kubernetes doesn’t have tools for automating this critical part of the ML lifecycle.
  4. Model Serving: Serving ML models for inference requires additional tools and configurations, which Kubernetes doesn’t simplify.

How Kubeflow Bridges this Gap

Kubeflow builds on top of Kubernetes to address these challenges with ML-optimized tools:

  • Pipelines: Define, automate, and monitor ML workflows visually or programmatically.
  • TFJob and PyTorchJob: Run distributed training jobs for TensorFlow and PyTorch with ease.
  • Katib: Automate hyperparameter tuning using advanced search algorithms.
  • KFServing (KServe): Simplify model deployment and scaling for production-grade inference.
  • Notebooks: Spin up Jupyter Notebooks directly within the Kubeflow dashboard for experiments.

In short, Kubeflow transforms Kubernetes into an ML-first platform.

Diagram explaining how kubeflow builds on top of Kubernetes
Kubeflow builds upon existing Kubernetes Architecture

Setting Up Kubeflow

To start experimenting with Kubeflow, you need a Kubernetes cluster and some basic familiarity with its CLI.

Prerequisites

  • A Kubernetes cluster (local: K3s/Minikube/Docker Desktop, or cloud: GKE/AKS/EKS).
  • kubectl CLI installed.

Steps to Install Kubeflow

  1. Deploy a Kubernetes cluster. For this I will be using Docker Desktop for creating a Kubernetes Cluster. You can do that by enabling Kubernetes in docker desktop’s settings. Check the status of your cluster using this command:
kubectl cluster-info

2. To deploy the Kubeflow Pipelines, run the following commands:

export PIPELINE_VERSION=2.3.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

3. Verify that the Kubeflow Pipelines UI is accessible by port-forwarding:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Then, you can open the Kubeflow Pipelines UI at http://localhost:8080/

Building a Simple ML Pipeline

The Problem: Iris Dataset Classification

We’ll create an ML pipeline to preprocess data, train a model, and evaluate its performance while utilizing kubeflow for portability and scalability of the lifecycle.

Step 1: Clone the Repository

Start by cloning the repository to your local machine:

https://github.com/AhmadHassan71/Scaling-AI-Workflows-with-Kubeflow-on-Kubernetes.git
cd Scaling-AI-Workflows-with-Kubeflow-on-Kubernetes

Step 2: Explore the Pipeline Components

The code in the notebook includes several components:

  1. Data Preprocessing: Reads and preprocesses the Iris dataset.
  2. Model Training: Trains a classification model using the processed data.
  3. Model Evaluation: Evaluates the trained model’s accuracy.

Step 3: Set Up Your Environment

Ensure your Kubeflow environment is ready. If you haven’t already installed Kubeflow, refer to the earlier section for setup instructions.

Step 4: Running the Cluster

Run your kubernetes cluster, in my case I am using Docker Desktop as mentioned earlier. Now, verify that the Kubeflow Pipelines UI is accessible:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Step 5: Run the Kubeflow Pipeline:

I have integrated the Kubeflow SDK into my Python environment. Running the code now compiles the associated YAML file:

# This part of the code defines the pipeline
@dsl.pipeline(
name='IRIS classifier Kubeflow Pipeline',
description='IRIS classifier'
)
def iris_classifier_pipeline(data_path: str):
#...

Now to run the Kubeflow Pipeline you can use the following code:

# Create an experiment and run the pipeline
experiment_name = 'iris_classifier_exp'
run_name = 'iris_classifier_run'
namespace = "kubeflow"
arguments = {"data_path": DATA_PATH}

kfp.compiler.Compiler().compile(pipeline_func, 'KubeFlow_Pipeline_IRIS_Classifier.yaml')
run_result = client.create_run_from_pipeline_func(pipeline_func, experiment_name=experiment_name, run_name=run_name, arguments=arguments)

Step 6: Results

You can now access the Kubeflow UI at http://localhost:8080/ to view and explore your pipelines.

The pipeline that we created shown in Kubeflow UI
The Pipeline we created is represented like this in Kubeflow UI

What I Learned

Building and deploying the Iris classification pipeline helped me understand:

  1. How Kubeflow simplifies orchestrating complex ML workflows compared to Kubernetes.
  2. The power of visualizing and monitoring ML pipelines through the Kubeflow dashboard.
  3. The practical steps involved in setting up and running a Kubeflow pipeline.

Kubeflow takes the heavy lifting out of deploying and managing ML applications, making it easier for AI engineers and researchers, who are the main audience of this solution, to focus on experimentation and optimization rather than infrastructure setup.

Conclusion

My journey into Kubeflow started with curiosity about why we need an ML-specific platform when Kubernetes exists. Through hands-on experience with the Iris pipeline, I realized the value Kubeflow brings to the table by extending Kubernetes for the ML lifecycle.

For software engineers venturing into MLOps, Kubeflow is more than a tool — it’s a gateway to understanding how ML applications are built, deployed, and scaled in the age of AI.

Resources

If you’re interested in diving deeper into Kubeflow and MLOps, here are some valuable resources to guide your journey:

Github Repositories utilized in this blog.

Official Documentation of Kubeflow.

Code labs and Hands-on Projects for Learning.

--

--

Ahmad Hassan
Ahmad Hassan

Written by Ahmad Hassan

I am a Computer Science student at Fast NUCES, passionate about cloud technologies and coding. I love solving problems and sharing insights through articles.

No responses yet