mlflow.pyspark.ml

mlflow.pyspark.ml.autolog(log_models=True, log_datasets=True, disable=False, exclusive=False, disable_for_unsupported_versions=False, silent=False, log_post_training_metrics=True, registered_model_name=None, log_input_examples=False, log_model_signatures=True, log_model_allowlist=None, extra_tags=None)[source]

Note

Autologging is known to be compatible with the following package versions: 3.1.2 <= pyspark <= 3.5.3. Autologging may not succeed when used with package versions outside of this range.

Enables (or disables) and configures autologging for pyspark ml estimators. This method is not threadsafe. This API requires Spark 3.0 or above.

When is autologging performed?

Autologging is performed when you call Estimator.fit except for estimators (featurizers) under pyspark.ml.feature.

Logged information

Parameters

Parameters obtained by estimator.params. If a param value is also an Estimator, then params in the the wrapped estimator will also be logged, the nested param key will be {estimator_uid}.{param_name}

Tags

An estimator class name (e.g. “LinearRegression”).
A fully qualified estimator class name (e.g. “pyspark.ml.regression.LinearRegression”).

Post training metrics

When users call evaluator APIs after model training, MLflow tries to capture the Evaluator.evaluate results and log them as MLflow metrics to the Run associated with the model. All pyspark ML evaluators are supported.

For post training metrics autologging, the metric key format is: “{metric_name}[-{call_index}]_{dataset_name}”

The metric name is the name returned by Evaluator.getMetricName()
If multiple calls are made to the same pyspark ML evaluator metric, each subsequent call adds a “call_index” (starting from 2) to the metric key.
MLflow uses the prediction input dataset variable name as the “dataset_name” in the metric key. The “prediction input dataset variable” refers to the variable which was used as the dataset argument of model.transform call. Note: MLflow captures the “prediction input dataset” instance in the outermost call frame and fetches the variable name in the outermost call frame. If the “prediction input dataset” instance is an intermediate expression without a defined variable name, the dataset name is set to “unknown_dataset”. If multiple “prediction input dataset” instances have the same variable name, then subsequent ones will append an index (starting from 2) to the inspected dataset name.

Limitations

MLflow cannot find run information for other objects derived from a given prediction result (e.g. by doing some transformation on the prediction result dataset).

Artifacts

An MLflow Model with the mlflow.spark flavor containing a fitted estimator (logged by mlflow.spark.log_model()). Note that large models may not be autologged for performance and storage space considerations, and autologging for Pipelines and hyperparameter tuning meta-estimators (e.g. CrossValidator) is not yet supported. See log_models param below for details.
For post training metrics API calls, a “metric_info.json” artifact is logged. This is a JSON object whose keys are MLflow post training metric names (see “Post training metrics” section for the key format) and whose values are the corresponding evaluator information, including evaluator class name and evaluator params.

How does autologging work for meta estimators?

When a meta estimator (e.g. Pipeline, CrossValidator, TrainValidationSplit, OneVsRest) calls fit(), it internally calls fit() on its child estimators. Autologging does NOT perform logging on these constituent fit() calls.

A “estimator_info.json” artifact is logged, which includes a hierarchy entry describing the hierarchy of the meta estimator. The hierarchy includes expanded entries for all nested stages, such as nested pipeline stages.

Parameter search: In addition to recording the information discussed above, autologging for parameter search meta estimators (CrossValidator and TrainValidationSplit) records child runs with metrics for each set of explored parameters, as well as artifacts and parameters for the best model and the best parameters (if available). For better readability, the “estimatorParamMaps” param in parameter search estimator will be recorded inside “estimator_info” artifact, see following description. Inside “estimator_info.json” artifact, in addition to the “hierarchy”, records 2 more items: “tuning_parameter_map_list”: a list contains all parameter maps used in tuning, and “tuned_estimator_parameter_map”: the parameter map of the tuned estimator. Records a “best_parameters.json” artifacts, contains the best parameter it searched out. Records a “search_results.csv” artifacts, contains search results, it is a table with 2 columns: “params” and “metric”.

Parameters

log_models – If True, if trained models are in allowlist, they are logged as MLflow model artifacts. If False, trained models are not logged. Note: the built-in allowlist excludes some models (e.g. ALS models) which can be large. To specify a custom allowlist, create a file containing a newline-delimited list of fully-qualified estimator classnames, and set the “spark.mlflow.pysparkml.autolog.logModelAllowlistFile” Spark config to the path of your allowlist file.
log_datasets – If True, dataset information is logged to MLflow Tracking. If False, dataset information is not logged.
disable – If True, disables the scikit-learn autologging integration. If False, enables the pyspark ML autologging integration.
exclusive – If True, autologged content is not logged to user-created fluent runs. If False, autologged content is logged to the active fluent run, which may be user-created.
disable_for_unsupported_versions – If True, disable autologging for versions of pyspark that have not been tested against this version of the MLflow client or are incompatible.
silent – If True, suppress all event logs and warnings from MLflow during pyspark ML autologging. If False, show all events and warnings during pyspark ML autologging.
log_post_training_metrics – If True, post training metrics are logged. Defaults to True. See the post training metrics section for more details.
registered_model_name – If given, each time a model is trained, it is registered as a new model version of the registered model with this name. The registered model is created if it does not already exist.
log_input_examples – If True, input examples from training datasets are collected and logged along with pyspark ml model artifacts during training. If False, input examples are not logged.
log_model_signatures –
If True, ModelSignatures describing model inputs and outputs are collected and logged along with spark ml pipeline/estimator artifacts during training. If False signatures are not logged.

Warning

Currently, only scalar Spark data types are supported. If model inputs/outputs contain non-scalar Spark data types such as pyspark.ml.linalg.Vector, signatures are not logged.

log_model_allowlist –

If given, it overrides the default log model allowlist in mlflow. This takes precedence over the spark configuration of “spark.mlflow.pysparkml.autolog.logModelAllowlistFile”.

The default log model allowlist in mlflow

# classification
pyspark.ml.classification.LinearSVCModel
pyspark.ml.classification.DecisionTreeClassificationModel
pyspark.ml.classification.GBTClassificationModel
pyspark.ml.classification.LogisticRegressionModel
pyspark.ml.classification.RandomForestClassificationModel
pyspark.ml.classification.NaiveBayesModel

# clustering
pyspark.ml.clustering.BisectingKMeansModel
pyspark.ml.clustering.KMeansModel
pyspark.ml.clustering.GaussianMixtureModel

# Regression
pyspark.ml.regression.AFTSurvivalRegressionModel
pyspark.ml.regression.DecisionTreeRegressionModel
pyspark.ml.regression.GBTRegressionModel
pyspark.ml.regression.GeneralizedLinearRegressionModel
pyspark.ml.regression.LinearRegressionModel
pyspark.ml.regression.RandomForestRegressionModel

# Featurizer model
pyspark.ml.feature.BucketedRandomProjectionLSHModel
pyspark.ml.feature.ChiSqSelectorModel
pyspark.ml.feature.CountVectorizerModel
pyspark.ml.feature.IDFModel
pyspark.ml.feature.ImputerModel
pyspark.ml.feature.MaxAbsScalerModel
pyspark.ml.feature.MinHashLSHModel
pyspark.ml.feature.MinMaxScalerModel
pyspark.ml.feature.OneHotEncoderModel
pyspark.ml.feature.RobustScalerModel
pyspark.ml.feature.RFormulaModel
pyspark.ml.feature.StandardScalerModel
pyspark.ml.feature.StringIndexerModel
pyspark.ml.feature.VarianceThresholdSelectorModel
pyspark.ml.feature.VectorIndexerModel
pyspark.ml.feature.UnivariateFeatureSelectorModel

# composite model
pyspark.ml.classification.OneVsRestModel

# pipeline model
pyspark.ml.pipeline.PipelineModel

# Hyper-parameter tuning
pyspark.ml.tuning.CrossValidatorModel
pyspark.ml.tuning.TrainValidationSplitModel

# SynapeML models
synapse.ml.cognitive.*
synapse.ml.exploratory.*
synapse.ml.featurize.*
synapse.ml.geospatial.*
synapse.ml.image.*
synapse.ml.io.*
synapse.ml.isolationforest.*
synapse.ml.lightgbm.*
synapse.ml.nn.*
synapse.ml.opencv.*
synapse.ml.stages.*
synapse.ml.vw.*

extra_tags – A dictionary of extra tags to set on each managed run created by autologging.