spaCy within MLflow
spaCy is the leading industrial-strength natural language processing library, designed from the ground up for production use. Created by Explosion AI, spaCy combines cutting-edge research with practical engineering to deliver fast, accurate, and scalable NLP solutions that power everything from chatbots and content analysis to document processing and knowledge extraction systems.
spaCy's production-first philosophy sets it apart from academic NLP libraries. With its streamlined API, extensive pre-trained models, and robust pipeline architecture, spaCy enables developers to build sophisticated NLP applications without sacrificing speed or maintainability.
Logging spaCy Models to MLflowβ
Basic Model Loggingβ
MLflow provides native support for spaCy models through the mlflow.spacy.log_model()
function:
import mlflow
import spacy
# Load or train your spaCy model
nlp = spacy.load("en_core_web_sm")
# Log the model to MLflow
with mlflow.start_run():
mlflow.spacy.log_model(nlp, name="spacy_model")
What Gets Automatically Captured
Model Components & Architectureβ
- π§ Pipeline Components: All pipeline components (tokenizer, tagger, parser, NER, text categorizer)
- π Model Configuration: Architecture details, hyperparameters, and component settings
- π― Component Metadata: Individual component configurations and performance metrics
- π§ Custom Components: User-defined pipeline components and extensions
Dependencies & Environmentβ
- π¦ spaCy Version: Exact spaCy version for reproducibility
- π Python Environment: Complete environment specification with all dependencies
- π Requirements: Automatic generation of pip requirements and conda environment
- π Model Dependencies: Language models and custom extensions
Deployment Artifactsβ
- π€ Complete Model: Full model serialization with vocabularies and weights
- π Model Metadata: Model size, components, and performance characteristics
- π·οΈ Model Signatures: Input/output schemas for validation (when applicable)
Automatic PyFunc Flavor for Text Classificationβ
When your spaCy model includes a TextCategorizer
component, MLflow automatically adds the PyFunc flavor for easy deployment:
import mlflow
import spacy
from spacy import Language
import pandas as pd
# Create a text classification pipeline
@Language.component("custom_textcat")
def create_textcat(nlp, name="textcat"):
return nlp.add_pipe("textcat", name=name)
nlp = spacy.blank("en")
nlp.add_pipe("textcat")
# Add labels to the text categorizer
nlp.get_pipe("textcat").add_label("POSITIVE")
nlp.get_pipe("textcat").add_label("NEGATIVE")
# Train your model (training code omitted for brevity)
with mlflow.start_run():
# Log model - PyFunc flavor added automatically
model_info = mlflow.spacy.log_model(nlp, name="text_classifier")
# Load and use for inference
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)
# Prepare input data as DataFrame
test_data = pd.DataFrame({"text": ["This is great!", "This is terrible!"]})
predictions = loaded_model.predict(test_data)
print(predictions)
Text Classification Integration Details
Automatic PyFunc Generationβ
- π― Smart Detection: MLflow automatically detects TextCategorizer components
- π DataFrame Input: PyFunc wrapper accepts pandas DataFrame with text column
- π Batch Processing: Efficient inference on multiple texts simultaneously
- π Probability Scores: Returns prediction probabilities for all categories
Input/Output Formatβ
- Input: pandas DataFrame with exactly one column containing text data
- Output: pandas DataFrame with "predictions" column containing category probabilities
- Format: Each prediction is a dictionary with category names as keys and probabilities as values
Deployment Benefitsβ
- π Universal Interface: Use standard MLflow serving infrastructure
- π¦ Easy Integration: Compatible with MLflow's deployment tools and APIs
- π Model Validation: Automatic input validation and error handling
- π Monitoring: Integration with MLflow's model monitoring capabilities
Advanced spaCy Training with MLflow Integrationβ
Custom Training Loggerβ
spaCy's training system can be integrated with MLflow through custom loggers registered in spaCy's component registry:
import sys
import spacy
from spacy import Language
from typing import IO, Callable, Tuple, Dict, Any, Optional
import mlflow
@spacy.registry.loggers("mlflow_logger.v1")
def mlflow_logger():
"""Custom MLflow logger for spaCy training integration."""
def setup_logger(
nlp: Language,
stdout: IO = sys.stdout,
stderr: IO = sys.stderr,
) -> Tuple[Callable, Callable]:
def log_step(info: Optional[Dict[str, Any]]):
"""Called by spaCy for every evaluation step."""
if info:
step = info["step"]
score = info["score"]
metrics = {}
# Log component-specific losses and scores
for pipe_name in nlp.pipe_names:
if pipe_name in info["losses"]:
loss = info["losses"][pipe_name]
metrics[f"{pipe_name}_loss"] = loss
metrics[f"{pipe_name}_score"] = score
# Log overall metrics
metrics["overall_score"] = score
mlflow.log_metrics(metrics, step=step)
def finalize():
"""Called by spaCy after training completion."""
# Log the final trained model
mlflow.spacy.log_model(nlp, name="trained_model")
mlflow.end_run()
return log_step, finalize
return setup_logger
Training Configuration Setup
Configuration File Integrationβ
- Generate Base Configuration:
python -m spacy init config --pipeline textcat --lang en config.cfg
- Update Logger Configuration:
[training.logger]
@loggers = "mlflow_logger.v1"
[training]
max_steps = 1000
eval_frequency = 100
- Configure Data Paths:
[paths]
train = "./train.spacy"
dev = "./dev.spacy"
Advanced Logger Featuresβ
- π Component-Level Tracking: Monitor individual pipeline component performance
- π― Custom Metrics: Log domain-specific evaluation metrics
- π Training Dynamics: Track learning curves and convergence patterns
- π Automatic Model Saving: Save best models based on validation performance
- π Experiment Metadata: Log training configuration and hyperparameters
Complete Training Integration Exampleβ
Here's a comprehensive example showing spaCy training with MLflow integration:
import mlflow
import spacy
import pandas as pd
from spacy.tokens import DocBin
from spacy.cli.train import train as spacy_train
import tempfile
import os
def prepare_training_data():
"""Prepare sample training data for text classification."""
# Sample data preparation
train_data = [
("This movie is excellent!", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Terrible film, waste of time", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
("Amazing storyline and acting", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("Boring and predictable", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]
# Convert to spaCy format
nlp = spacy.blank("en")
doc_bin = DocBin()
for text, annotations in train_data:
doc = nlp.make_doc(text)
doc.cats = annotations["cats"]
doc_bin.add(doc)
return doc_bin
# Prepare training data
train_docs = prepare_training_data()
dev_docs = prepare_training_data() # Use same data for simplicity
# Save training data
train_docs.to_disk("./train.spacy")
dev_docs.to_disk("./dev.spacy")
# Configuration content
config_content = """
[nlp]
lang = "en"
pipeline = ["textcat"]
[components]
[components.textcat]
factory = "textcat"
[training]
max_steps = 100
eval_frequency = 20
[training.logger]
@loggers = "mlflow_logger.v1"
[paths]
train = "./train.spacy"
dev = "./dev.spacy"
"""
# Write configuration file
with open("config.cfg", "w") as f:
f.write(config_content)
# Start MLflow experiment
with mlflow.start_run(run_name="spacy_text_classification"):
# Log training configuration
mlflow.log_params(
{
"model_type": "text_classification",
"pipeline": "textcat",
"language": "en",
"max_steps": 100,
"eval_frequency": 20,
}
)
# Train the model (this will use our custom logger)
spacy_train("config.cfg")
print("Training completed and logged to MLflow!")
Saving and Loading spaCy Modelsβ
Basic Model Operationsβ
MLflow provides multiple ways to save and load spaCy models:
import mlflow
import spacy
# Load a pre-trained model
nlp = spacy.load("en_core_web_sm")
# Save with MLflow
model_info = mlflow.spacy.log_model(nlp, name="spacy_model")
# Load back in native spaCy format
loaded_nlp = mlflow.spacy.load_model(model_info.model_uri)
# Use the loaded model
doc = loaded_nlp("This is a test sentence.")
for token in doc:
print(f"{token.text}: {token.pos_}, {token.dep_}")
Loading Options and Use Cases
Native spaCy Loadingβ
# Full spaCy functionality - all pipeline components
nlp = mlflow.spacy.load_model(model_info.model_uri)
# Access all spaCy features
doc = nlp("Analyze this text completely.")
entities = [(ent.text, ent.label_) for ent in doc.ents]
dependencies = [(token.text, token.dep_, token.head.text) for token in doc]
PyFunc Loading (Text Classification Only)β
# Simplified interface for text classification
classifier = mlflow.pyfunc.load_model(model_info.model_uri)
# DataFrame input required
import pandas as pd
test_data = pd.DataFrame({"text": ["Sample text to classify"]})
predictions = classifier.predict(test_data)
When to Use Each Approachβ
- π§ Native spaCy: Full NLP pipeline access, custom components, advanced features
- π PyFunc: Text classification deployment, simple inference, production serving
- π Mixed Approach: Development with native, deployment with PyFunc
Model Signatures for spaCy Modelsβ
Adding signatures to spaCy models improves documentation and enables validation:
import mlflow
from mlflow.models import infer_signature
import pandas as pd
import spacy
# Load and prepare model
nlp = spacy.load("en_core_web_sm")
# For text classification models, create sample data
sample_input = pd.DataFrame({"text": ["This is a sample sentence for classification."]})
# If model has TextCategorizer, get predictions for signature
if nlp.has_pipe("textcat"):
# Create wrapper for prediction
class SpacyWrapper:
def __init__(self, nlp):
self.nlp = nlp
def predict(self, df):
results = []
for text in df.iloc[:, 0]:
doc = self.nlp(text)
results.append({"predictions": doc.cats})
return pd.DataFrame(results)
wrapper = SpacyWrapper(nlp)
sample_output = wrapper.predict(sample_input)
signature = infer_signature(sample_input, sample_output)
else:
signature = None
# Log model with signature
mlflow.spacy.log_model(
nlp, name="spacy_model", signature=signature, input_example=sample_input
)
Manual Signature Definition
For complete control over your model signature:
import mlflow
from mlflow.types import Schema, ColSpec
from mlflow.models import ModelSignature
# Define input schema for text classification
input_schema = Schema([ColSpec("string", "text")])
# Define output schema
output_schema = Schema(
[ColSpec("object", "predictions")] # Dictionary with category probabilities
)
# Create signature
signature = ModelSignature(inputs=input_schema, outputs=output_schema)
# Log model with manual signature
mlflow.spacy.log_model(nlp, name="model", signature=signature)
Manual signatures are useful when:
- You need precise control over input/output specifications
- Working with custom output formats
- The automatic inference doesn't capture your intended schema
- You want to document expected data types explicitly
Advanced spaCy Tracking Patternsβ
Custom Component Trackingβ
Track custom spaCy components and their performance:
import mlflow
import spacy
from spacy import Language
from spacy.tokens import Doc, Span
@Language.component("sentiment_analyzer")
def sentiment_analyzer(doc):
"""Custom component for sentiment analysis."""
# Simple rule-based sentiment (replace with actual ML model)
positive_words = {"good", "great", "excellent", "amazing", "wonderful"}
negative_words = {"bad", "terrible", "awful", "horrible", "worst"}
pos_count = sum(1 for token in doc if token.lower_ in positive_words)
neg_count = sum(1 for token in doc if token.lower_ in negative_words)
if pos_count > neg_count:
sentiment = "positive"
score = 0.8
elif neg_count > pos_count:
sentiment = "negative"
score = 0.8
else:
sentiment = "neutral"
score = 0.5
# Add sentiment as custom attribute
doc._.sentiment = sentiment
doc._.sentiment_score = score
return doc
# Register custom extensions
Doc.set_extension("sentiment", default=None)
Doc.set_extension("sentiment_score", default=0.0)
# Create pipeline with custom component
nlp = spacy.blank("en")
nlp.add_pipe("sentiment_analyzer")
# Test and evaluate custom component
test_texts = [
"This is a great product!",
"Terrible service, very bad.",
"It's okay, nothing special.",
]
with mlflow.start_run():
# Log component information
mlflow.log_params(
{
"custom_components": ["sentiment_analyzer"],
"pipeline": nlp.pipe_names,
"model_version": "1.0",
}
)
# Evaluate custom component
correct_predictions = 0
total_predictions = len(test_texts)
results = []
for text in test_texts:
doc = nlp(text)
results.append(
{"text": text, "sentiment": doc._.sentiment, "score": doc._.sentiment_score}
)
# Log evaluation metrics
mlflow.log_metric("component_accuracy", correct_predictions / total_predictions)
# Log model with custom component
mlflow.spacy.log_model(nlp, name="custom_sentiment_model")
# Log evaluation results as artifact
import json
with open("evaluation_results.json", "w") as f:
json.dump(results, f, indent=2)
mlflow.log_artifact("evaluation_results.json")
Multi-Language Model Trackingβ
Track experiments across different languages and models:
Multilingual Experiment Tracking
import mlflow
import spacy
from collections import defaultdict
def evaluate_multilingual_models():
"""Evaluate performance across multiple language models."""
# Define language models to test
models = {
"en": "en_core_web_sm",
"de": "de_core_news_sm",
"fr": "fr_core_news_sm",
"es": "es_core_news_sm",
}
# Sample texts for each language
test_texts = {
"en": "Apple Inc. is a technology company based in California.",
"de": "Apple Inc. ist ein Technologieunternehmen in Kalifornien.",
"fr": "Apple Inc. est une entreprise technologique basΓ©e en Californie.",
"es": "Apple Inc. es una empresa de tecnologΓa con sede en California.",
}
with mlflow.start_run(run_name="multilingual_comparison"):
results = {}
for lang, model_name in models.items():
try:
with mlflow.start_run(run_name=f"{lang}_model", nested=True):
# Load language-specific model
nlp = spacy.load(model_name)
# Log model information
mlflow.log_params(
{
"language": lang,
"model_name": model_name,
"pipeline_components": nlp.pipe_names,
"model_size": len(nlp.vocab),
}
)
# Process text and extract entities
doc = nlp(test_texts[lang])
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Log results
mlflow.log_metrics(
{
"num_entities": len(entities),
"num_tokens": len(doc),
"processing_time": 0.1, # Placeholder
}
)
# Log the model
mlflow.spacy.log_model(nlp, name=f"{lang}_model")
results[lang] = {"entities": entities, "tokens": len(doc)}
except OSError:
print(f"Model {model_name} not available, skipping {lang}")
# Log summary results
mlflow.log_param("total_languages", len(results))
mlflow.log_metric(
"avg_entities_per_lang",
sum(r["entities"].__len__() for r in results.values()) / len(results),
)
return results
# Run multilingual evaluation
results = evaluate_multilingual_models()
Benefits of Multilingual Trackingβ
- π Cross-Language Comparison: Compare model performance across languages
- π Unified Metrics: Track consistent metrics across different language models
- π Model Selection: Choose best models for multilingual applications
- π Performance Analysis: Identify language-specific strengths and weaknesses
Pipeline Optimization Trackingβ
Track different pipeline configurations and optimizations:
import mlflow
import spacy
import time
from itertools import combinations, product
def optimize_pipeline_configuration():
"""Test different pipeline configurations for optimal performance."""
# Define pipeline variations to test
base_components = ["tok2vec", "tagger", "parser", "ner"]
optional_components = ["lemmatizer", "textcat"]
# Test different combinations
configurations = []
for r in range(len(optional_components) + 1):
for combo in combinations(optional_components, r):
config = base_components + list(combo)
configurations.append(config)
with mlflow.start_run(run_name="pipeline_optimization"):
best_config = None
best_score = 0
for i, components in enumerate(configurations):
with mlflow.start_run(run_name=f"config_{i}", nested=True):
# Create model with specific components
nlp = spacy.blank("en")
# Add components (simplified for example)
available_components = {
"tok2vec": "tok2vec",
"tagger": "tagger",
"parser": "parser",
"ner": "ner",
"lemmatizer": "lemmatizer",
}
pipeline_components = []
for comp in components:
if comp in available_components:
try:
nlp.add_pipe(comp)
pipeline_components.append(comp)
except:
continue
# Log configuration
mlflow.log_params(
{
"components": pipeline_components,
"num_components": len(pipeline_components),
"config_id": i,
}
)
# Simulate performance testing
test_text = "This is a test sentence for pipeline evaluation."
start_time = time.time()
doc = nlp(test_text)
processing_time = time.time() - start_time
# Calculate synthetic performance score
performance_score = (
len(pipeline_components) * 10 - processing_time * 100
)
# Log metrics
mlflow.log_metrics(
{
"processing_time": processing_time,
"performance_score": performance_score,
"memory_usage": len(nlp.vocab), # Simplified metric
}
)
# Log model
mlflow.spacy.log_model(nlp, name="pipeline_model")
# Track best configuration
if performance_score > best_score:
best_score = performance_score
best_config = pipeline_components
# Log best configuration summary
mlflow.log_params(
{
"best_config": best_config,
"best_score": best_score,
"total_configs_tested": len(configurations),
}
)
return best_config, best_score
# Run pipeline optimization
best_config, score = optimize_pipeline_configuration()
print(f"Best configuration: {best_config} with score: {score}")
Production Deploymentβ
Local Model Servingβ
Deploy your spaCy models locally using MLflow's serving infrastructure:
# First, log your model with proper configuration
import mlflow
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
with mlflow.start_run() as run:
# Create example input for signature
sample_input = pd.DataFrame({"text": ["Sample text for classification"]})
# Log model with dependencies
model_info = mlflow.spacy.log_model(
nlp,
name="spacy_model",
input_example=sample_input,
pip_requirements=["spacy>=3.0.0"],
)
model_uri = (
model_info.model_uri
) # The format of this attribute is 'models:/<model_id>'
Then deploy the model using the MLflow CLI:
# Serve the model locally (for text classification models with PyFunc flavor)
mlflow models serve -m models:/<model_id> -p 5000
# Test the deployment
curl http://localhost:5000/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [{"text": "This is a great product!"}]}'
Advanced Deployment Options
The mlflow models serve
command supports several options for spaCy models:
# Specify environment manager
mlflow models serve -m models:/<model_id> -p 5000 --env-manager conda
# Enable MLServer for enhanced performance
mlflow models serve -m models:/<model_id> -p 5000 --enable-mlserver
# Set custom host for network access
mlflow models serve -m models:/<model_id> -p 5000 --host 0.0.0.0
For production deployments, consider:
- Using MLServer (
--enable-mlserver
) for better performance and scalability - Building Docker images with
mlflow models build-docker
- Deploying to cloud platforms like Azure ML or Amazon SageMaker
- Setting up proper environment management and dependency isolation
- Implementing model monitoring and health checks
Real-World Applicationsβ
The MLflow-spaCy integration excels across diverse NLP domains:
- π° Content Analysis: Track sentiment analysis, topic modeling, and content classification systems for media and publishing
- π₯ Healthcare NLP: Monitor clinical text processing, medical entity extraction, and diagnostic support systems
- πΌ Enterprise Search: Log document processing, information extraction, and knowledge management pipelines
- π E-commerce Intelligence: Track product categorization, review analysis, and customer intent recognition
- π§ Communications Processing: Monitor email classification, chatbot training, and customer service automation
- ποΈ Legal Tech: Log contract analysis, document review, and legal entity recognition systems
- π Multilingual Applications: Track translation quality, cross-lingual transfer, and international content processing
- π Business Intelligence: Monitor text analytics, report generation, and automated insights extraction
Conclusionβ
The MLflow-spaCy integration provides a comprehensive solution for tracking, managing, and deploying production-grade NLP systems. By combining spaCy's industrial-strength capabilities with MLflow's experiment tracking, you create a workflow that is:
- π Transparent: Every aspect of NLP model development is documented and trackable
- π Reproducible: Experiments can be recreated exactly with proper environment management
- π Comparable: Different approaches can be evaluated side-by-side with consistent metrics
- π Scalable: From simple prototypes to enterprise-scale NLP systems
- π₯ Collaborative: Team members can share and build upon each other's NLP research and development
Whether you're building intelligent chatbots, analyzing customer feedback, or extracting insights from unstructured text, the MLflow-spaCy integration provides the foundation for organized, reproducible, and scalable NLP development that grows with your ambitions from prototype to production-scale deployment.