Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
Databricks
Srijith Rajamohan, Ph.D.
John O’Dwyer
Open
30+ Million
Unify your data
ecosystem with open
source, standards and
formats
monthly downloads
Notebooks
Datasets
Data Engineers
Questions for Scalable ML
▪ Track the provenance and reason for model creation
▪ What training data was used, if any?
▪ Proprietary data, sensitive data, storage, data retention period?
▪ Real-time or batch?
▪ How are the models being used and who is using it?
▪ Exploratory analysis and production environment?
▪ Is model performance being measured regularly and is the model being updated?
▪ Is the model well documented to ensure reuse?
▪ Is the model deployment process being automated?
▪ Institutional adoption and support
Best Practices for ML
▪ Software engineering practices
▪ Code quality best practices
▪ Validate your data
▪ Ensure proper data types and format are fed to your model (Schema validation)
▪ Ensure no data drift, can render a supervised model ineffective
▪ Version and track your experiments like code!
▪ Changing hyperparameters, inputs, code etc.
▪ Monitor predictive performance over time
▪ Ensure model performance does not degrade over time
▪ Ensure model fairness across different classes of data (bias)
What is MLOps?
Build -> Test -> Deploy -> Monitor -> Feedback -> Build
Model management
Databricks Ecosystem for ML/DL
▪ Integrated Environment
▪ Use compute instances from AWS, Azure or GCP
▪ Centered around a notebook environment
▪ Version control them with GitHub
▪ Integrated ‘DBFS’ filesystem that can mount cloud filesystems like S3
▪ Mix SQL, Python, R and Bash in the same notebook
▪ Schedule jobs to run anytime
▪ Databricks Runtimes (DBRs)
▪ Preinstalled with packages for ML/DL
▪ Additional packages can be installed per cluster or per notebook
▪ MLflow integrated into the Databricks platform
▪ Model tracking for experiment management/reproducibility
▪ MLflow projects for packaging an experiment
▪ Model serving with MLflow
Workspace
Workspace
Notebooks
Job scheduling
Job page
Experiments
Registered models
The Data Preparation
The Delta Lake Architecture
Data Store and Versioning
Delta Lake Feature Store
▪ Scalable metadata ▪ Data stored needs to be transformed
▪ Time travel into features to be useful
▪ Feature tables are Delta tables
▪ Open format ▪ Feature Stores can save these features
▪ Unified Batch and Streaming ▪ Discoverable and reusable across an
▪ Schema enforcement organization
▪ Ensures consistency for Data Engineers,
Data Scientists and ML Engineers
▪ Track feature lineage in a model
ETL and EDA
▪ Delta lake
▪ Save data in scalable file formats like Parquet
▪ Delta file formats can let you version control your data
▪ ETL
▪ Read data
▪ PySpark - Ideal for large data
▪ Tensorflow (tf.data) and Pytorch (DataLoader)
▪ Clean and process data
▪ PySpark/Pandas API on Spark can work with large datasets across clusters
▪ Clean and prepare the data
▪ Extract features and save them using Feature Stores
▪ EDA
▪ Preliminary data analysis such as inspecting records, summary statistics
▪ Visualize the data and its distribution
The Model Build
Model training
▪ DBRs provide your favorite DL frameworks such as Tensorflow, Pytorch,
Keras etc.
▪ Integration with MLflow for model tracking
▪ Hyperparameter tuning with Hyperopt/Optuna
▪ Seamlessly run single node but multi-CPU/multi-GPU jobs
▪ Distributed training on multiple nodes with Horovod
▪ NVlink/NCCL enabled instances available for accelerating DL workloads
▪ Tightly coupled - Train directly on Spark Dataframes with Horovod Estimator
▪ Train on distributed Spark clusters with Horovod Runner
Distributed Training with Spark/Horovod
Distributed Training with Spark/Horovod contd...
Invoke training across multiple nodes
▪ Quantization-aware training
▪ Lower-precision training to minimize memory/compute requirements
▪ Federated learning
▪ Decentralized learning with the Federated Averaging algorithm (Google)
▪ Keep data on device
▪ Model is updated with data on device and updates sent back to central server
▪ Updates from all devices are averaged
▪ Privacy-preserving learning
▪ Learn from data that is encrypted or with minimal exposure to the data
Model tracking with MLflow
Send a request
curl -X POST -H "Content-Type:application/json; format=pandas-split"
--data '{"columns":["alcohol", "chlorides", "citric acid",
],"data":[[12.8, 0.029, 0.48]]}' http://127.0.0.1:1234/invocations
Thank you!