-5.2 C
Washington
Thursday, January 23, 2025

Apache DolphinScheduler in MLOps: Create machine learning workflows quickly

TechApache DolphinScheduler in MLOps: Create machine learning workflows quickly

Key takeaways:

  1. MLOps is a concept that enables data scientists and IT teams to collaborate and speed up model development and deployment by monitoring, validating, and managing machine learning models. 
  2. In practice, the ML codes may only take a small part of the entire system, and the other related elements required are large and complex.
  3. Apache DolphinScheduler is adding a variety of machine learning-related task plugins to help data analysts and data scientists easily use DolphinScheduler.
  4. Although there are various types of MLOps systems, their core concepts are similar and can be roughly divided into four categories.
  5. In the future, Apache DolphinScheduler will divide the supported MLOps components into three modules, namely data management, modeling, and deployment.

MLOps, the operation of machine learning models, is a thoroughly studied concept among computer scientists. Think of it as DevOps for machine learning, a concept that enables data scientists and IT teams to collaborate and speed up model development and deployment by monitoring, validating, and managing machine learning models. MLOps expedite the process from experimenting and developing, deploying models to production, and performing quality control for the users.

In this article, I’ll discuss the following topics:

  • New functions of MLOps introduced in Apache DolphinScheduler
  • Machine learning tasks supported by Apache DolphinScheduler
  • The usage of Jupyter components and MLflow components
  • The Apache DolphinScheduler and MLOps integration plan

What is MLOps?

pXu0vusXrSYAG0ttTyTxF3JVp7UKdfSLZ7yBBQmI9A1Weu5F0jt b Y7y

Figure 1. MLOps is the set of practices at the intersection of Machine Learning, DevOps, and Data Engineering.

MLOps is the DevOps of the machine learning era. Its main function is to connect the model construction team with the business, operation, and maintenance team, and establish a standardized model development, deployment, and operation process so that corporations can grow businesses using machine learning capabilities.
(refs: https://en.wikipedia.org/wiki/MLOps#cite_note-tds1-1)

In practice, the ML codes may only take a small part of the entire system, and the other related elements required are large and complex.

Figure 2. MLOps and ML tools landscape (v.1 January 2021) 

Although there are various types of MLOps systems, their core concepts are s===imilar and can be roughly divided into the following four categories:

  • data management
  • modeling
  • deployment
  • monitoring


DolphinScheduler is adding a variety of machine learning-related task plugins to help data analysts and data scientists to use DolphinScheduler.

  • Supports scheduling and running ML tasks
  • Supports user’s training tasks using various frameworks
  • Supports scheduling and running mainstream MLOps
  • Provides out-of-the-box mainstream MLOPs projects for users
  • Supports orchestrating various modules in building ML platforms
  • Applies different projects in different modules according to how the MLOps is matched with the task

ML Tasks Supported by Apache DolphinScheduler

Here is a current list of the supported task plugins:

JwTaFsWZlAHNttRvubIKlozSE3RSJSWjbvGWwifPcCHpyZBPCzUgaHjTszoHJzIPpzzQmtJTiqgK7bbyRamkw 19mrhZTzhKatPeqPTM0K4gGxNwpIyarEtNkOn7VpBHX EbMeG1PlBZCLFzAr8JbM2x2gsNK7Is7

Figure 3. The Current ML tasks supported by Apache DolphinScheduler

Jupyter Task Plugin

Jupyter Notebook is a web-based application for interactive computing. It can be applied to the whole process of computing: development, documentation, running code, and displaying results.

Papermill is a tool that can be parameterized and execute Jupyter Notebooks.

X2Esg3M3WR8v5a3kggywS1cIAQF677nUjcbyDzlkZkjA2ZZ2pcRC17Y5clGZiRxTvg6tBzgn1FxFmmf3U9Q0PCkdve3st8pmLlNKi086BMHPvnGe5Qs6Qd93MWjdep3L KSPhRGwl0zx7kXdFVQRF jrQLfOS9l5nIL1giuFt5eehOwzu3HtYVIW4A

Figure 4. Jupyter Task Plugin

MLflow Task Plugin

MLflow is an excellent MLOps open source project for managing the life cycle of machine learning, including experimentation, reproducibility, deployment, and central model registration.

j5nU5T6UeZrfOOvSy1W9cRvllQk3guaGUu YqGBz6YSbP6YFQuayd2 x7vHN 95KmaHAoy4IDXs0K1rYc0EsDZRYL7ZIFmbjBdAyse4bYPfEDDx52BXAesszcy dE8AMCuosUyIyMfBXgEeSfryvbbGuu039SHOH2WbqFWgFAVVkAYOPNJWk6yjD w

Figure 5. MLflow Task Plugin

OpenMLDB Task Plugin

OpenMLDB is an excellent open source machine learning database, providing a full-stack FeatureOps solution for production.

OpenMLDB task plugin is used to execute tasks on the OpenMLDB cluster.


Figure 7. OpenMLDB task plugin

Usage of Jupyter Components and MLflow Components

You can configure conda environment variables in common.properties and create a conda environment for executing Jupyter Notebook, as shown here:

4apW DCxAvJ3BR ghyubxefoMqdd3B1UWLAL4Havg3W6QJMjnxVum0w586eCuPemVfYS8GIN1mvuDHhZq5rgErxqXDz4PVQZSVjJyav2Nn0v0XrWldzMmmIVpFmaKSqR2MnGjnbTI4YwzdxizGCm pNab1MkC3DDTrXllBubm8pxnieDEq fft7dug

Figure 8. Create a conda environment


Call Jupyter task to run Notebook.

1. Prepare a Jupyter Notebook

2. Use DolphinScheduler to create a Jupyter task

3. Run the workflow

The following is a Notebook for training a classification model using the SVM and iris datasets.

Notebook receives the following four parameters:

1. experiment_name: TheExperiment name recorded at MLflow Service Center

2. C: SVM parameter

3. kernel: SVM parameter

4. model_name: The name of the model registered to the MLflow Model Center

h0YZCbCRoYzHWUavy3y1WDZLCPUAd8fOERboACCyovn35tOpo 3F yYbr tjgK1R6dnF5PeWoN2dmA3EPzCefdc1InY9pV10Yq

Figure 9. Train model notebook

Drag Jupyter components to canvas, and create a task, shown below.

The task will run Notebook: /home/lucky/WhaleOps/jupyter/MLOps/training_iris_svm.ipynb, and save the result in the following route:

/home/lucky/WhaleOps/jupyter/MLOps/training_iris_svm/03.ipynb

Set the runtime parameter C to “1.0,” and set the kernel to “linear,” The running conda environment is the kernel: “jupyter_test.”

HKyjIKBdeoQdd2gx5qgtIc3WqUBtPUDTHWodhD Grl9By0Seua7l85 msh2zN0i2hiPDqk2alLWGNMR4hPH1gbjo FY0oyTk57uJcRsu6znhQw3VFigure 10. The running conda environment is the kernel: “jupyter_test”


We can replicate two more identical tasks with different parameters. So we get three Jupyter tasks with different parameters, as follows:

p21V5GtMorQjf6OETM7jWoE2WjzHCW XWkDxWKVkPDlYtJVDiXggKXmItAYt ON0cRNo6pmzSTIOLaEB aKVTWZLOLuqMi45U FnQ91gPUAENAo TZLg8 1PNoOrAZlR PE33lrkolRKWmEeb5okAlW3ga7lkyMPTDgvkGuZl4WEzJzqgKzRiKyKgFigure 11. 3 Jupyter tasks with different parameters


After the creation is complete, we can see our new workflow in the workflow definition (this workflow contains 3 Jupyter tasks). Once the workflow is running, you can click the task instance to check how each task is executed, and view the log of each task.

Figure 12. View the log of each task


Usage scenarios

  • Data exploration and analysis
  • Training models
  • Regular online data monitoring

MLflow Component Usage

We can use MLflow task to train the model by following steps:

1. Prepare a dataset

2. Create MLflow training tasks with DolphinScheduler

3. Run the workflow

An example of creating a workflow is as follows, including two MLflow tasks:

Figure 13. An example of creating a workflow


Task 1: Use SVM to train the iris classification model, and set the following parameters, in which the hyperparameter search space is used for parameter adjustment. If it is not filled in, the hyperparameters will not be searched.

Figure 14. Set parameters


Task 2: Use the AutoML method to train the model. Use flaml as the AutoML tool, and set the search time to 60 seconds, only allowing for using lgbm, xgboost as the estimator.

Db4Htt5muMegl3UxgxNgs5e5LyhDtSw4ZJoUFONL5Wvst02Yn4Z64w45CssU85qgnPai604kIqb78kgVlxF2Qrm2IaeK1rHFigure 15. Details of the Executed task instances

Deploy tasks with MLflow

1. Select the model version to deploy

2. Use the DolphinScheduler created MLflow to deploy tasks

3. Simple test interface

As mentioned above, we have registered some models in the MLflow Model Center, we can open 127.0.0.1:5000 to see the model versions.

Create a task for MLflow Models. Specify that the model is iris_model (production version), and the monitoring port is set to 7000.

Figure 16. Determine the model URI and monitoring port

HV0ZIp RBYXkK3wEPHJAdUM76fyD99oqCF0 RociLYqS5AFqNJsHDLKnDMG t34Iro lVs ojSm7i3i7wqrAH5kR6FDEhrnqMZSNd6Tg Ehs0gFD JviSFUIoanG0zVU0cuMTqXKWz4k1xFigure 17. Specific running mechanism

Figure 18. Test the Customizable running results



Automatic deployment after training the model, for example:

Using the workflow we created above (Jupyter training model, MLflow deployment model) as a sub-workflow, and connect them to form a new workflow.

Apache DolphinScheduler and MLOps Integration Plan



PNPbQQggXKkyIRkriyZfCKGDytemoJyon1nCN9WFff DJS2wvqQVNitPm1KNiSk4Q6Di2P3knELRy6r4faGcmI9fxWbiNWAeli uXKnboDxNqvSWcU 6axRBrw

Figure 19. MLOps landscape of Apache DolphinScheduler


The above picture is a display diagram of machine learning-related tools and platforms. Apache DolphinScheduler will selectively support some of these tools and platforms that have a wide range of use and high value.

In the future, Apache DolphinScheduler will divide the supported MLOps components into three modules, namely data management, modeling, and deployment. The components involved are mainly DVC (Data Version Control), integrated Kubeflow modeling, and provide Seldon Core, BentoML, and Kubeflow, among other deployment tools to suit the needs of different scenarios.

How to integrate more tools so that Apache DolphinScheduler can better serve users is a topic that we contemplate in the long run. We welcome more partners who are interested in MLOps or open source to participate in the joint career.

About the Author

Name: Zhou Jieguang

Bio: Senior Algorithm Engineer of WhaleOps, Apache DolphinScheduler Committer, 5+ years of working experience in NLP.

tmjfq71fgVIN3FuBJPF5dFV5 ZI3VjigWSxctlCt7fwa4NvgOwGndI43TDTNFumEG1b0UKze IliWJzjeYiC0EA1lxaqw9fdDomlyqqGtzZAw4vYa e5CcAcu86fQk5KtZ RJdtl

GitHub: https://github.com/apache/dolphinscheduler

Official Website: https://dolphinscheduler.apache.org/

Mail List:dev@dolphinscheduler@apache.org

Slack:https://s.apache.org/dolphinscheduler-slack

Check out our other content

Check out other tags:

Most Popular Articles