The Evolution of Software Development: MLOps, CI/CD, and Modern Practices
Overview
- Introduction
- Traditional Software Development process
- AI/ML System process
- Challenges in AI/ML System
- Generic Steps in AI/ML System
- MLOps Maturity Model Levels
- Deployment Types
1. Introduction
It is an approach or method to operationalize the AI/ML systems. The operations include in this process include automation, performance monitoring, and event correlations.
What is the difference between AIOps and MLOps?
AIOps is a way to automate the system with the help of ML and Big Data, MLOps is a way to standardize the process of deploying ML systems and filling the gaps between teams, to give all project stakeholders more clarity.
What is the difference between Traditional software development and Machine Learning software development?
Traditional Software Development: Traditional software development is the software development process used to design and develop simple software.
It is used when the security and many other factors of the software are not much important.
It is used by freshers to develop the software. It consists the only code and we have to look up the code. If any updation should be made in software, we only look for code and update it.
It consists of five phases:
1. Requirements analysis
2. Design
3. Implementation
4. Coding and Testing
5. Maintenance
Machine Learning Software Development: In this process, code is present and with code, we have to look up for managing the data, and model.
In this, it is not about code, but the data and models have a large impact. The data is dynamic where it is changed every time.
Based on the data the model is trained before. When any new features are added to data. If we want to pass the newly updated data with a new feature then it will unable to make predictions. So, to update the changes in the model we are using AI/MLOps
2. Traditional Software Development Process
There are mainly three methodologies in this process
- Waterfall Methodology
- Agile Methodology
- DevOps Methodology
Waterfall Methodology
The waterfall methodology is a breakdown of project into activities into sequential phases. Where each step in this phase is depended on its previous step.
Characteristics:
- It is unidirectional, no feedback loop is present.
- It requires clear-cut requirements (No change in the future).
- Rigid approach (No change in any approach).
- It is not a user-centric (Made with developer perspective)
Agile(Adaptive) Methodology
Agile software development is the software development process used to design complicated software.
It is used when the software is quite sensitive and complicated. It is used when security is much more important. It is used by professionals to develop the software.
Characteristics:
- These focus on adaptability, modularity and reusability.
- Bidirectional (client + Development).
- It is used by professionals.
DevOps Methodology
DevOps is a methodology meant to improve work throughout the software development lifecycle. You can visualize a DevOps process as an infinite loop, comprising these steps: plan, code, build, test, release, deploy, operate, monitor, and — through feedback — plan, which resets the loop.
It consists of two main components:
- CI -> Continuous Integration
- CD -> Continuous Delivery and Deployment.
Evolution of Methodology
Waterfall -> Agile -> DevOps
3. Machine Learning/Artificial Intelligence Operations Software Development
What is MLOps?
It is extension of DevOps principles to make reliable of AI/ML system development.
To know about MLOps we need to understand.
MLOps is an ML engineering culture and practice that aims at unifying:
- ML system development (Dev)
- ML system operation(ops)
Techniques for implementing and automating.
- Continuous Integration(CI) extends the testing and validating code and components by adding testing and validating data and models.
- Continuous Delivery(CD) concerns with delivery of an ML training pipeline that automatically deploys another the ML model prediction service.
- Continuous Training(CT) is unique to ML systems property, which automatically retrains ML models for re-deployment.
- Continuous Monitory(CM) concerns with monitoring production data and model performance metrics, which are bound to business metrics.
MLOps advocates for automation, monitoring at all stages of the ML system development process including integration, testing, releasing, deployment, infrastructure management.
4. Challenges in AI/ML System
Challenge 1
Building an integrated ML system and continuously operate it in production with a vast array of the surrounding infrastructure.
Challenge 2
To automate the process from beginning to end while managing
- Different teams (ML Engineers, DevOps Engineers, MLOps Engineers, Data Science team).
- Using Different technologies
- Follow different routines and also make the auditable, reproducible
Challenge 3
Various dependencies like
- Data Dependency
- Model Complexity
- Reproducibility
- Testing
- Monitoring
5. Generic Steps in AI/ML
- Data Extraction (Extract raw data into a cleaned data by data engineering team).
- Data Analysis
- Exploratory Data Analysis
- Statistical Analysis
3. Data Preparation
- Prepare data
- Handle missing data.
- Drop columns
- Turn categorical values into numerical
4. Model Training
5. Model Evaluation
6. Model Validation
7. Model Serving
8. Model Monitoring (Crucial Step)
6. MLOps Maturity Model Levels
Level 0: Manual Process
Here there are no MLOps, but everything is manual.
Here most of the work is done by manual, building, training, tuning and deploying. All this are done in a jupyter notebook.
Characteristics
- It is a script driven and interactive process.
- Disconnection between ML and DevOps team.
- No Continuous Delivery(CD)of model(infrequent release of models).
- No Continuous Integration(CI).
- Deployment(Production service). Not deploying entire ML System.
- Lack of monitoring of performance.
Challenges
- Time consuming(No CD and No CI)
- Cost of maintenance will be very high.
- No new ML ideas will be pushed easily (Because we are building everything from scratch no automation, that’s why it takes alot of time).
- More bugs will be found.
Possible Solution
- Active monitoring of model in production.
- Frequently (depends on the situation) retrain your model.
- Frequent experiment with new optimized implementation to produce better model to make it more adaptable to business needs and experiment(Robust)
eg: Latest State-of-the-Art models
Level 1: ML Pipeline Automation
Aim:
- Perform Continuous Integration (CI) of the model in production by automating ML pipeline. With Continuous Delivery model in prediction service including prediction pipeline.
- To automate new data come to deliver the update model based on latest trend.
- To automate data validation and model validation.
- To automate pipeline trigger and metadata management.
- Pipeline orchestration (order execution of stages).
Characteristics of Observation
- Rapid experiment possible due to orchestration.
2. Continuous Training (CT) of model in production.
3. Experimental operational symmetry
4. Modularized code components and pipeline are reusable and shareable.
5. Continuous Delivery (CD) of the model.
6. Pipeline Deployment.
Frequently Used Terms in Level 1 Workflow
- GLUE Code implementing the many modules one after the other. This is not a standard approach following across the industry. This leads to failure of the most ML project.
- Pipeline Orchestration: Order execution of components (piece of code / module of shell scripts).
Available Tools
a. Apache Beam
b. Apache Airflow
c. Kubeflow (Kubernetes based) popular.
Advantages
i. Standard orchestration and abstraction .
ii. Supported by many cloud platforms.
iii. Easy to monitor and debug.
3. DAGS- Directed acyclic graphs
In mathematics, particularly graph theory, and computer science, a directed acyclic graph is a directed graph with no directed cycles.
That is, it consists of vertices and edges (also called arcs), with each edge directed from one vertex to another, such that following those directions will never form a closed loop.
A directed graph is a DAG if and only if it can be topologically ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions. DAGs have numerous scientific and computational applications, ranging from biology (evolution, family trees, epidemiology) to information science (citation networks) to computation (scheduling).
Level 1: Important Components:
Pipeline Expectation:
New data/live data -> New model version on new data.
When we are getting continuous live data from the users. Based on new data the new model version has to be developed.
Requirements to fulfill above expectation
- Automatic data visualization: Decide whether to start retraining in production or stop execution of pipeline and go for manual investigation(in this process data science team is involved) on following basis:
1.1. Data schema skews: Data is not in complance with expected schema or on per Data sharing agreement(DSA).
Solution: Stop pipeline and let data science team investigate.
1.2. Data value skews: When data pattern or statistical properties changes goes trigger retraining ad run entire pipeline.
2. Automate Model Validation: This step is required after successful model training on new data then go for evaluate and validate your model before pushing into production.
There are two types of steps involve
2.1. Offline type: This step is taken before serving into the production.
- Getting prediction quality of the trained model. How? Calculate the evaluation metric of your trained model on test dataset.
- Compare the metrics received and gathered in above step from metrics of current model in production.
- Performance of your model must be consistent on regional/cluster samples.
- Test for infrastructure compatibility before model deployment.(model file extension — pickle, h5, pb, pth).
2.2. Online type: This step is taken after serving your model into production.
- Canary Deployment: Role out for subset of users or servers . Eg: Colab
- A/B Testing Setup(Manual process).
- Multi-arm bandit deployment: Algorithms dynamically allocates in favour of better performing variation.
3. Feature Store
It is a central repository for standard define storage and axis of feature for training and serving per phase.
How it happens?
- It provides API for high batch serving and low latency real time serving for feature values. It supports training and serving workloads.
How it will be helpful?
- It avoids similar features, and different definitions.
- It maintains features and relevant metadata.
- It serves up-to-date feature values.
- It avoids training and serving skew by using a feature store as a central source for experiments Continuous training or online serving.
4. Metadata Storage
- It records information about each execution of the machine learning pipeline.
Why?
To help in achieving the below:
a. Reproducibility.
b. Comparisons
c. Artifacts Lineage.
d. Debugging errors and anomalies.
What does it records of every run?
i. Pipeline components and components versions extended.
eg: Data version control(DVC)
ii. Records
- Start and end time.
- Data and time.
- Time duration extension of each step.
iii. Execution.
iv. Parameter passed to the pipeline.
v. Pointers to the artifacts provided in each step.
Eg: Path to prepared data, validate data, compented data.
vi. Pointers to previous stable model incase role back is required.
vii. Model evaluation metrics for every training and testing set.
5. Pipeline Trigger
It is used for training on new data in production conditions to trigger Machine learning pipeline.
i. On demand: Manual execution
ii. Scheduled: If new data coming at regular interval.
iii. When new data is available.
iv. Model performance degradition if you observe noticeable change in performance drop.
v. Significant change in the data distribution in experiment stage.
When to use Level 1 Workflow?
- No frequent deployment of new implementation of pipeline.
- Manual testing of few pipeline is sufficient.
- Manual deployment of new pipeline.
- Submit tested code base for pipeline to it feature to deploy your ML pipeline into target environment.
- Whenever new model is created based on new data {not on new ML ideas}.
Challenges
- Time consuming (more manual process is involved).
- No frequent updates.
- Maintenance issues.
- Prone to bugs.
Solution: Go for continuous integration, continuous delivery to automated build/test of ml/deployment pipeline.
Level 2 Workflow: ML Pipeline Automation With CI/CD
Aim:
- Rapid and reliable updates of ML pipeline
- Automating the build/test/deployment of ML pipeline(This entire process is beneficial for data science team)(They can focus on research and development of new ideas)
- New ideas can be brought into production easily.
CI/CD Workflow
- Dev or Experiment Stage:
Input: Sample data from feature store
Process: Experimentating with new ML ideas
Output: Source code of ML Pipeline
2. Continous Integration Pipeline:
Input: Previous step output
Process: Testing
output: Packages/ Containers/ executable
3. Continous Delivery Pipeline
Input: Previous step output
Process: Deployment of the packages in the target environment(production / preproduction )
Output: Deployed ML pipeline with new ML ideas
4. Continous Training(Automated Training)
Input: Previous step output
Process: Triggering training in production environment(live) based on triggering mechanism
Output: Trained model and model registery
5. Model Continous Delivery
Input: Previous step output
Process: Pick the suitable model from model registry and integrate with prediction survice at prediction pipeline
Output: Working prediction service
6. Monitoring Stage:
Process: It will be collecting the stats of model performance on new data
Output:
- Trigger execution of pipeline in production
- Trigger the new experiment cycle
Summary of Level 2 Workflow
- Adaptive in nature
- Less imvolement of data science team in maintenance due to automation.
- Less time consuming
- More frequent updates
- Scalibility
- Maintenance is easy
7. Deployment Types
7.1. Automated:
You can push the code in development branch of your github repository after CI/ CD will start automatically. For this we will be using Github Actions.
Github Actions Workflow
7. 2 Manual
After several successfull iteration and preproduction environment deployment is done.
8. Conclusion
In this blog post, we gone thorough exploration of software development processes, with a specific emphasis on AI/ML system development, MLOps (Machine Learning Operations), and the role of CI/CD (Continuous Integration and Continuous Deployment). We also discussed about AIOps from MLOps and highlight the distinctions between traditional software development and machine learning software development. You discuss the challenges associated with building and operating integrated ML systems, addressing the need for automation and team collaboration. Additionally, we outline the generic steps involved in AI/ML, from data extraction to model monitoring. Finally, we present a clear framework for understanding MLOps maturity levels, showcasing the progression from manual processes to advanced automation with CI/CD integration.We also addresses deployment processes, comparing manual and automated approaches, enriching your readers’ understanding of the end-to-end development lifecycle.