A Software Engineer’s Journey into ML Deployment : Discovering Kubeflow
Introduction
As a software engineer, I’ve always been fascinated by the ways cloud technology has evolved over the past decade to solve complex infrastructure issues. Recently, while working on a personal project with an ML module, I was exploring options for deploying my ML model. The first problem I faced was understanding why do we even need ML-specific solutions in the first place.
Why not just use Kubernetes for ML model deployment? Isn’t Kubernetes the all-in-one solution? Initially, I thought, “Why do we even need this abstraction when Kubernetes already exists?”
This scenario challenged my current cloud knowledge and pushed me to dive deeper into MLOps, a field I knew little about before. While exploring MLOps, I came across several cloud solutions such as TensorFlow Extended (TFX), AWS Neuron, and Google’s Vertex AI. However, the one whose architecture intrigued me the most was Kubeflow.
I chose Kubeflow because it resonated the most with my idea of an ML deployment solution. From what I understood, Kubeflow builds on top of existing Kubernetes components, optimizing them specifically for the ML lifecycle.
As I delved deeper into the ML lifecycle and experimented with deploying ML models, I realized how vastly different ML applications are from web applications. The gap Kubeflow bridges isn’t just an abstraction — it makes deploying ML applications significantly easier compared to Kubernetes.
This blog shares my journey of discovering Kubeflow, learning its purpose, and deploying my first ML pipeline, complete with diagrams and a beginner-friendly example.
Understanding the ML Lifecycle: A Paradigm Shift for Software Engineers
How Web Applications Work
Web applications typically follow a simple, linear lifecycle:
- Develop: Write code and create the application.
- Deploy: Push the application to a server or cloud environment.
- Maintain: Monitor logs and fix bugs as they arise.
This works well because web apps don’t usually require constant re-training or re-building once deployed.
The Complexity of ML Applications
ML applications are fundamentally different. They follow an iterative, cyclical lifecycle:
- Data Collection: Gather and preprocess massive amounts of data.
- Model Training: Train ML models, often requiring distributed computing.
- Hyperparameter Tuning: Optimize model performance through multiple runs.
- Deployment: Push the model into production.
- Monitoring & Retraining: Continuously monitor performance and retrain with new data.
Each stage involves heavy compute, automation, and scaling, which makes traditional DevOps tools like Kubernetes fall short for ML-specific needs.
Where Kubernetes Falls Short for Machine Learning
While Kubernetes excels in container orchestration, it doesn’t natively address the unique challenges of the ML lifecycle:
- Distributed Training: ML models often require distributed training using frameworks like TensorFlow or PyTorch. Kubernetes doesn’t provide built-in support for this.
- Pipeline Orchestration: Defining, automating, and tracking ML workflows is tedious and error-prone in Kubernetes.
- Hyperparameter Tuning: Kubernetes doesn’t have tools for automating this critical part of the ML lifecycle.
- Model Serving: Serving ML models for inference requires additional tools and configurations, which Kubernetes doesn’t simplify.
How Kubeflow Bridges this Gap
Kubeflow builds on top of Kubernetes to address these challenges with ML-optimized tools:
- Pipelines: Define, automate, and monitor ML workflows visually or programmatically.
- TFJob and PyTorchJob: Run distributed training jobs for TensorFlow and PyTorch with ease.
- Katib: Automate hyperparameter tuning using advanced search algorithms.
- KFServing (KServe): Simplify model deployment and scaling for production-grade inference.
- Notebooks: Spin up Jupyter Notebooks directly within the Kubeflow dashboard for experiments.
In short, Kubeflow transforms Kubernetes into an ML-first platform.
Setting Up Kubeflow
To start experimenting with Kubeflow, you need a Kubernetes cluster and some basic familiarity with its CLI.
Prerequisites
- A Kubernetes cluster (local: K3s/Minikube/Docker Desktop, or cloud: GKE/AKS/EKS).
kubectl
CLI installed.
Steps to Install Kubeflow
- Deploy a Kubernetes cluster. For this I will be using Docker Desktop for creating a Kubernetes Cluster. You can do that by enabling Kubernetes in docker desktop’s settings. Check the status of your cluster using this command:
kubectl cluster-info
2. To deploy the Kubeflow Pipelines, run the following commands:
export PIPELINE_VERSION=2.3.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"
3. Verify that the Kubeflow Pipelines UI is accessible by port-forwarding:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
Then, you can open the Kubeflow Pipelines UI at http://localhost:8080/
Building a Simple ML Pipeline
The Problem: Iris Dataset Classification
We’ll create an ML pipeline to preprocess data, train a model, and evaluate its performance while utilizing kubeflow for portability and scalability of the lifecycle.
Step 1: Clone the Repository
Start by cloning the repository to your local machine:
https://github.com/AhmadHassan71/Scaling-AI-Workflows-with-Kubeflow-on-Kubernetes.git
cd Scaling-AI-Workflows-with-Kubeflow-on-Kubernetes
Step 2: Explore the Pipeline Components
The code in the notebook includes several components:
- Data Preprocessing: Reads and preprocesses the Iris dataset.
- Model Training: Trains a classification model using the processed data.
- Model Evaluation: Evaluates the trained model’s accuracy.
Step 3: Set Up Your Environment
Ensure your Kubeflow environment is ready. If you haven’t already installed Kubeflow, refer to the earlier section for setup instructions.
Step 4: Running the Cluster
Run your kubernetes cluster, in my case I am using Docker Desktop as mentioned earlier. Now, verify that the Kubeflow Pipelines UI is accessible:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
Step 5: Run the Kubeflow Pipeline:
I have integrated the Kubeflow SDK into my Python environment. Running the code now compiles the associated YAML file:
# This part of the code defines the pipeline
@dsl.pipeline(
name='IRIS classifier Kubeflow Pipeline',
description='IRIS classifier'
)
def iris_classifier_pipeline(data_path: str):
#...
Now to run the Kubeflow Pipeline you can use the following code:
# Create an experiment and run the pipeline
experiment_name = 'iris_classifier_exp'
run_name = 'iris_classifier_run'
namespace = "kubeflow"
arguments = {"data_path": DATA_PATH}
kfp.compiler.Compiler().compile(pipeline_func, 'KubeFlow_Pipeline_IRIS_Classifier.yaml')
run_result = client.create_run_from_pipeline_func(pipeline_func, experiment_name=experiment_name, run_name=run_name, arguments=arguments)
Step 6: Results
You can now access the Kubeflow UI at http://localhost:8080/ to view and explore your pipelines.
What I Learned
Building and deploying the Iris classification pipeline helped me understand:
- How Kubeflow simplifies orchestrating complex ML workflows compared to Kubernetes.
- The power of visualizing and monitoring ML pipelines through the Kubeflow dashboard.
- The practical steps involved in setting up and running a Kubeflow pipeline.
Kubeflow takes the heavy lifting out of deploying and managing ML applications, making it easier for AI engineers and researchers, who are the main audience of this solution, to focus on experimentation and optimization rather than infrastructure setup.
Conclusion
My journey into Kubeflow started with curiosity about why we need an ML-specific platform when Kubernetes exists. Through hands-on experience with the Iris pipeline, I realized the value Kubeflow brings to the table by extending Kubernetes for the ML lifecycle.
For software engineers venturing into MLOps, Kubeflow is more than a tool — it’s a gateway to understanding how ML applications are built, deployed, and scaled in the age of AI.
Resources
If you’re interested in diving deeper into Kubeflow and MLOps, here are some valuable resources to guide your journey: