mlflow.johnsnowlabs

The mlflow.johnsnowlabs module provides an API for logging and loading Spark NLP and NLU models. This module exports the following flavors:

Johnsnowlabs (native) format

Allows models to be loaded as Spark Transformers for scoring in a Spark session. Models with this flavor can be loaded as NluPipelines, with underlying Spark MLlib PipelineModel This is the main flavor and is always produced.

mlflow.pyfunc

Supports deployment outside of Spark by instantiating a SparkContext and reading input data as a Spark DataFrame prior to scoring. Also supports deployment in Spark as a Spark UDF. Models with this flavor can be loaded as Python functions for performing inference. This flavor is always produced.

This flavor gives you access to 20.000+ state-of-the-art enterprise NLP models in 200+ languages for medical, finance, legal and many more domains. Features include: LLM’s, Text Summarization, Question Answering, Named Entity Recognition, Relation Extration, Sentiment Analysis, Spell Checking, Image Classification, Automatic Speech Recognition and much more, powered by the latest Transformer Architectures. The models are provided by John Snow Labs and requires a John Snow Labs Enterprise NLP License. You can reach out to us for a research or industry license.

These keys must be present in your license json:

  1. SECRET: The secret for the John Snow Labs Enterprise NLP Library

  2. SPARK_NLP_LICENSE: Your John Snow Labs Enterprise NLP License

  3. AWS_ACCESS_KEY_ID: Your AWS Secret ID for accessing John Snow Labs Enterprise Models

  4. AWS_SECRET_ACCESS_KEY: Your AWS Secret key for accessing John Snow Labs Enterprise Models

You can set them using the following code:

import os
import json

# Write your raw license.json string into the 'JOHNSNOWLABS_LICENSE_JSON' env variable
creds = {
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "SPARK_NLP_LICENSE": "...",
    "SECRET": "...",
}
os.environ["JOHNSNOWLABS_LICENSE_JSON"] = json.dumps(creds)
mlflow.johnsnowlabs.get_default_conda_env()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Returns

The default Conda environment for MLflow Models produced by calls to save_model() and log_model().

mlflow.johnsnowlabs.get_default_pip_requirements()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Returns

A list of default pip requirements for MLflow Models produced by this flavor. Calls to save_model() and log_model() produce a pip environment that, at minimum, contains these requirements.

mlflow.johnsnowlabs.load_model(model_uri, dfs_tmpdir=None, dst_path=None, **kwargs)[source]

Load the Johnsnowlabs MLflow model from the path.

Parameters
  • model_uri

    The location, in URI format, of the MLflow model. For example:

    • /Users/me/path/to/local/model

    • relative/path/to/local/model

    • s3://my_bucket/path/to/model

    • runs:/<mlflow_run_id>/run-relative/path/to/model

    • models:/<model_name>/<model_version>

    • models:/<model_name>/<stage>

    For more information about supported URI schemes, see Referencing Artifacts.

  • dfs_tmpdir – Temporary directory path on Distributed (Hadoop) File System (DFS) or local filesystem if running in local mode. The model is loaded from this destination. Defaults to /tmp/mlflow.

  • dst_path – The local filesystem path to which to download the model artifact. This directory must already exist. If unspecified, a local output path will be created.

Returns

A nlu.NLUPipeline.

Example
import mlflow
from johnsnowlabs import nlp
import os

# Write your raw license.json string into the 'JOHNSNOWLABS_LICENSE_JSON' env variable
creds = {
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "SPARK_NLP_LICENSE": "...",
    "SECRET": "...",
}
os.environ["JOHNSNOWLABS_LICENSE_JSON"] = json.dumps(creds)

# start a spark session
nlp.start()
# Load you MLflow Model
model = mlflow.johnsnowlabs.load_model("johnsnowlabs_model")

# Make predictions on test documents
# supports datatypes defined in https://nlp.johnsnowlabs.com/docs/en/jsl/predict_api#supported-data-types
prediction = model.transform(["I love Covid", "I hate Covid"])
mlflow.johnsnowlabs.log_model(spark_model, artifact_path, conda_env=None, code_paths=None, dfs_tmpdir=None, sample_input=None, registered_model_name=None, signature: mlflow.models.signature.ModelSignature = None, input_example: Union[pandas.core.frame.DataFrame, numpy.ndarray, dict, list, csr_matrix, csc_matrix, str, bytes, tuple] = None, await_registration_for=300, pip_requirements=None, extra_pip_requirements=None, metadata=None, store_license=False)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Log a Johnsnowlabs NLUPipeline created via nlp.load(), as an MLflow artifact for the current run. This uses the MLlib persistence format and produces an MLflow Model with the johnsnowlabs flavor.

Note: If no run is active, it will instantiate a run to obtain a run_id.

Parameters
  • spark_model

    NLUPipeline obtained via nlp.load()

  • store_license – If True, the license will be stored with the model and used and re-loading it.

  • artifact_path – Run relative artifact path.

  • conda_env

    Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file. If provided, this describes the environment this model should be run in. At minimum, it should specify the dependencies contained in get_default_conda_env(). If None, the default get_default_conda_env() environment is added to the model. The following is an example dictionary representation of a Conda environment:

    {
        'name': 'mlflow-env',
        'channels': ['defaults'],
        'dependencies': [
            'python=3.8.15',
            'johnsnowlabs'
        ]
    }
    

  • code_paths

    A list of local filesystem paths to Python file dependencies (or directories containing file dependencies). These files are prepended to the system path when the model is loaded. Files declared as dependencies for a given model should have relative imports declared from a common root path if multiple files are defined with import dependencies between them to avoid import errors when loading the model.

    You can leave code_paths argument unset but set infer_code_paths to True to let MLflow infer the model code paths. See infer_code_paths argument doc for details.

    For a detailed explanation of code_paths functionality, recommended usage patterns and limitations, see the code_paths usage guide.

  • dfs_tmpdir – Temporary directory path on Distributed (Hadoop) File System (DFS) or local filesystem if running in local mode. The model is written in this destination and then copied into the model’s artifact directory. This is necessary as Spark ML models read from and write to DFS if running on a cluster. If this operation completes successfully, all temporary files created on the DFS are removed. Defaults to /tmp/mlflow.

  • sample_input – A sample input used to add the MLeap flavor to the model. This must be a PySpark DataFrame that the model can evaluate. If sample_input is None, the MLeap flavor is not added.

  • registered_model_name – If given, create a model version under registered_model_name, also creating a registered model if one with the given name does not exist.

  • signature

    ModelSignature describes model input and output Schema. The model signature can be inferred from datasets with valid model input (e.g. the training dataset with target column omitted) and valid model output (e.g. model predictions generated on the training dataset), for example:

    from mlflow.models.signature import infer_signature
    
    train = df.drop_column("target_label")
    predictions = ...  # compute model predictions
    signature = infer_signature(train, predictions)
    

  • input_example – one or several instances of valid model input. The input example is used as a hint of what data to feed the model. It will be converted to a Pandas DataFrame and then serialized to json using the Pandas split-oriented format, or a numpy array where the example will be serialized to json by converting it to a list. Bytes are base64-encoded. When the signature parameter is None, the input example is used to infer a model signature.

  • await_registration_for – Number of seconds to wait for the model version to finish being created and is in READY status. By default, the function waits for five minutes. Specify 0 or None to skip waiting.

  • pip_requirements – Either an iterable of pip requirement strings (e.g. ["johnsnowlabs", "-r requirements.txt", "-c constraints.txt"]) or the string path to a pip requirements file on the local filesystem (e.g. "requirements.txt"). If provided, this describes the environment this model should be run in. If None, a default list of requirements is inferred by mlflow.models.infer_pip_requirements() from the current software environment. If the requirement inference fails, it falls back to using get_default_pip_requirements(). Both requirements and constraints are automatically parsed and written to requirements.txt and constraints.txt files, respectively, and stored as part of the model. Requirements are also written to the pip section of the model’s conda environment (conda.yaml) file.

  • extra_pip_requirements

    Either an iterable of pip requirement strings (e.g. ["pandas", "-r requirements.txt", "-c constraints.txt"]) or the string path to a pip requirements file on the local filesystem (e.g. "requirements.txt"). If provided, this describes additional pip requirements that are appended to a default set of pip requirements generated automatically based on the user’s current software environment. Both requirements and constraints are automatically parsed and written to requirements.txt and constraints.txt files, respectively, and stored as part of the model. Requirements are also written to the pip section of the model’s conda environment (conda.yaml) file.

    Warning

    The following arguments can’t be specified at the same time:

    • conda_env

    • pip_requirements

    • extra_pip_requirements

    This example demonstrates how to specify pip requirements using pip_requirements and extra_pip_requirements.

  • metadata – Custom metadata dictionary passed to the model and stored in the MLmodel file.

Returns

A ModelInfo instance that contains the metadata of the logged model.

Example
import os
import json
import pandas as pd
import mlflow
from johnsnowlabs import nlp

# Write your raw license.json string into the 'JOHNSNOWLABS_LICENSE_JSON' env variable
creds = {
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "SPARK_NLP_LICENSE": "...",
    "SECRET": "...",
}
os.environ["JOHNSNOWLABS_LICENSE_JSON"] = json.dumps(creds)

# Download & Install Jars/Wheels if missing and Start a spark Session
nlp.start()

# For more details on trainable models and parameterization like embedding choice see
# https://nlp.johnsnowlabs.com/docs/en/jsl/training
trainable_classifier = nlp.load("train.classifier")

# Create a sample training dataset
data = pd.DataFrame(
    {"text": ["I hate covid ", "I love covid"], "y": ["negative", "positive"]}
)

# Fit and get a trained classifier
trained_classifier = trainable_classifier.fit(data)
trained_classifier.predict("He hates covid")

# Log it
mlflow.johnsnowlabs.log_model(trained_classifier, "my_trained_model")
mlflow.johnsnowlabs.save_model(spark_model, path, mlflow_model=None, conda_env=None, code_paths=None, dfs_tmpdir=None, sample_input=None, signature: mlflow.models.signature.ModelSignature = None, input_example: Union[pandas.core.frame.DataFrame, numpy.ndarray, dict, list, csr_matrix, csc_matrix, str, bytes, tuple] = None, pip_requirements=None, extra_pip_requirements=None, metadata=None, store_license=False)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Save a Spark johnsnowlabs Model to a local path.

By default, this function saves models using the Spark MLlib persistence mechanism. Additionally, if a sample input is specified using the sample_input parameter, the model is also serialized in MLeap format and the MLeap flavor is added.

Parameters
  • store_license – If True, the license will be stored with the model and used and re-loading it.

  • spark_model – Either a pyspark.ml.pipeline.PipelineModel or nlu.NLUPipeline object to be saved. Every johnsnowlabs model is a PipelineModel and loadable as nlu.NLUPipeline.

  • path – Local path where the model is to be saved.

  • mlflow_model – MLflow model config this flavor is being added to.

  • conda_env

    Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file. If provided, this describes the environment this model should be run in. At minimum, it should specify the dependencies contained in get_default_conda_env(). If None, the default get_default_conda_env() environment is added to the model. The following is an example dictionary representation of a Conda environment:

    {
        'name': 'mlflow-env',
        'channels': ['defaults'],
        'dependencies': [
            'python=3.8.15',
            'johnsnowlabs'
        ]
    }
    

  • code_paths

    A list of local filesystem paths to Python file dependencies (or directories containing file dependencies). These files are prepended to the system path when the model is loaded. Files declared as dependencies for a given model should have relative imports declared from a common root path if multiple files are defined with import dependencies between them to avoid import errors when loading the model.

    You can leave code_paths argument unset but set infer_code_paths to True to let MLflow infer the model code paths. See infer_code_paths argument doc for details.

    For a detailed explanation of code_paths functionality, recommended usage patterns and limitations, see the code_paths usage guide.

  • dfs_tmpdir – Temporary directory path on Distributed (Hadoop) File System (DFS) or local filesystem if running in local mode. The model is be written in this destination and then copied to the requested local path. This is necessary as Spark ML models read from and write to DFS if running on a cluster. All temporary files created on the DFS are removed if this operation completes successfully. Defaults to /tmp/mlflow.

  • sample_input – A sample input that is used to add the MLeap flavor to the model. This must be a PySpark DataFrame that the model can evaluate. If sample_input is None, the MLeap flavor is not added.

  • signature

    ModelSignature describes model input and output Schema. The model signature can be inferred from datasets with valid model input (e.g. the training dataset with target column omitted) and valid model output (e.g. model predictions generated on the training dataset), for example:

    from mlflow.models.signature import infer_signature
    
    train = df.drop_column("target_label")
    predictions = ...  # compute model predictions
    signature = infer_signature(train, predictions)
    

  • input_example – one or several instances of valid model input. The input example is used as a hint of what data to feed the model. It will be converted to a Pandas DataFrame and then serialized to json using the Pandas split-oriented format, or a numpy array where the example will be serialized to json by converting it to a list. Bytes are base64-encoded. When the signature parameter is None, the input example is used to infer a model signature.

  • pip_requirements – Either an iterable of pip requirement strings (e.g. ["johnsnowlabs", "-r requirements.txt", "-c constraints.txt"]) or the string path to a pip requirements file on the local filesystem (e.g. "requirements.txt"). If provided, this describes the environment this model should be run in. If None, a default list of requirements is inferred by mlflow.models.infer_pip_requirements() from the current software environment. If the requirement inference fails, it falls back to using get_default_pip_requirements(). Both requirements and constraints are automatically parsed and written to requirements.txt and constraints.txt files, respectively, and stored as part of the model. Requirements are also written to the pip section of the model’s conda environment (conda.yaml) file.

  • extra_pip_requirements

    Either an iterable of pip requirement strings (e.g. ["pandas", "-r requirements.txt", "-c constraints.txt"]) or the string path to a pip requirements file on the local filesystem (e.g. "requirements.txt"). If provided, this describes additional pip requirements that are appended to a default set of pip requirements generated automatically based on the user’s current software environment. Both requirements and constraints are automatically parsed and written to requirements.txt and constraints.txt files, respectively, and stored as part of the model. Requirements are also written to the pip section of the model’s conda environment (conda.yaml) file.

    Warning

    The following arguments can’t be specified at the same time:

    • conda_env

    • pip_requirements

    • extra_pip_requirements

    This example demonstrates how to specify pip requirements using pip_requirements and extra_pip_requirements.

  • metadata – Custom metadata dictionary passed to the model and stored in the MLmodel file.

Example
from johnsnowlabs import nlp
import mlflow
import os

# Write your raw license.json string into the 'JOHNSNOWLABS_LICENSE_JSON' env variable
creds = {
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "SPARK_NLP_LICENSE": "...",
    "SECRET": "...",
}
os.environ["JOHNSNOWLABS_LICENSE_JSON"] = json.dumps(creds)

# Download & Install Jars/Wheels if missing and Start a spark Session
nlp.start()

# load a model
model = nlp.load("en.classify.bert_sequence.covid_sentiment")
model.predict(["I hate covid", "I love covid"])

# Save model as pyfunc and johnsnowlabs format
mlflow.johnsnowlabs.save_model(model, "saved_model")
model = mlflow.johnsnowlabs.load_model("saved_model")
# Predict with reloaded model,
# supports datatypes defined in https://nlp.johnsnowlabs.com/docs/en/jsl/predict_api#supported-data-types
model.predict(["I hate covid", "I love covid"])