Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark (MMLSpark) provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.

MMLSpark requires Scala 2.11+, Spark 2.1+, and either Python 2.7 or Python 3.5+.


  • Easily ingest images from HDFS into Spark example 301
  • Pre-process image data using transforms from OpenCV example 302
  • Featurize images using pre-trained deep neural nets using CNTK example 301
  • Train DNN-based image classification models on N-Series GPU VMs on Azure example 301
  • Featurize free-form text data using convenient APIs on top of primitives in SparkML via a single transformer example:201
  • Train classification and regression models easily via implicit featurization of data example 101
  • Compute a rich set of evaluation metrics including per-instance metrics example:102

See our notebooks for all examples.