Skip to main content
Artifacts are the data objects that flow between steps in a ZenML pipeline. Every input and output of a step is automatically versioned, tracked, and stored as an artifact. This creates a complete lineage of your ML workflows, from raw data to trained models.

What is an Artifact?

An artifact represents a versioned data object in ZenML. Artifacts are:
  • Automatically tracked: Every step input/output becomes an artifact
  • Versioned: Each artifact has a unique version with metadata
  • Type-aware: ZenML knows the data type and uses appropriate serialization
  • Lineage-tracked: You can trace which steps produced and consumed each artifact
  • Cached: Artifacts enable step caching for faster development

Artifact Creation

Artifacts are created automatically when steps produce outputs:
from zenml import step, pipeline
import pandas as pd

@step
def load_data() -> pd.DataFrame:
    """This step produces a DataFrame artifact."""
    return pd.read_csv("data.csv")

@step
def preprocess(data: pd.DataFrame) -> pd.DataFrame:
    """This step consumes and produces DataFrame artifacts."""
    return data.dropna()

@pipeline
def data_pipeline():
    # Artifacts flow between steps automatically
    raw_data = load_data()  # Creates artifact version 1
    clean_data = preprocess(raw_data)  # Creates artifact version 2

Artifact Versioning

Every artifact is automatically versioned by ZenML:
from zenml import step
from typing import Annotated
from zenml.artifacts import ArtifactConfig

@step
def versioned_step() -> Annotated[
    pd.DataFrame,
    ArtifactConfig(
        name="customer_data",
        version="v1.0"  # Explicit version
    )
]:
    """Create a versioned artifact."""
    return load_customer_data()

Version Behavior

  • Auto-versioning: If no version is specified, ZenML auto-increments versions
  • Named versions: You can specify meaningful version names like “production” or “v2.1”
  • Numeric versions: Integer versions are auto-incremented (1, 2, 3, …)
  • Immutable: Once created, artifact versions cannot be modified
Artifact versions are immutable. If you run a step again with the same version name, it will fail. Use auto-versioning during development.

Artifact Configuration

Configure artifact properties using ArtifactConfig:
from typing import Annotated
from zenml.artifacts import ArtifactConfig
from zenml.enums import ArtifactType
from zenml import step

@step
def configured_artifact_step() -> Annotated[
    Any,
    ArtifactConfig(
        name="trained_model",           # Custom name
        version="2.0",                   # Specific version
        tags=["production", "xgboost"],  # Tags for filtering
        artifact_type=ArtifactType.MODEL, # Semantic type
        run_metadata={                   # Custom metadata
            "accuracy": 0.95,
            "training_time": 3600,
            "framework": "xgboost"
        }
    )
]:
    """Step that produces a configured artifact."""
    model = train_model()
    return model

Artifact Types

ZenML supports several semantic artifact types from zenml.enums.ArtifactType:
  • ArtifactType.DATA - Datasets and data files
  • ArtifactType.MODEL - Machine learning models
  • ArtifactType.SERVICE - Deployed services
  • ArtifactType.DATA_ANALYSIS - Analysis results and reports
  • ArtifactType.BASE - Generic artifacts
from zenml.enums import ArtifactType

@step
def produce_model() -> Annotated[
    Any,
    ArtifactConfig(artifact_type=ArtifactType.MODEL)
]:
    return train_model()

Materializers

Materializers handle the serialization and deserialization of artifacts. ZenML automatically selects the appropriate materializer based on the type annotation.

Built-in Materializers

ZenML includes materializers for common types:
  • Primitives: str, int, float, bool, dict, list
  • NumPy: np.ndarray
  • Pandas: pd.DataFrame, pd.Series
  • Scikit-learn: Models and transformers
  • PyTorch: torch.nn.Module
  • TensorFlow/Keras: Models and tensors
import pandas as pd
import numpy as np

@step
def data_step() -> pd.DataFrame:
    """ZenML uses built-in Pandas materializer."""
    return pd.DataFrame({"col1": [1, 2, 3]})

@step
def array_step() -> np.ndarray:
    """ZenML uses built-in NumPy materializer."""
    return np.array([1, 2, 3, 4, 5])

Custom Materializers

Create custom materializers for your own types:
from zenml.materializers import BaseMaterializer
from zenml.enums import ArtifactType
from zenml.io import fileio
import json
from typing import Type

class CustomObject:
    def __init__(self, data: dict):
        self.data = data

class CustomMaterializer(BaseMaterializer):
    """Materializer for CustomObject."""
    
    ASSOCIATED_TYPES = (CustomObject,)
    ASSOCIATED_ARTIFACT_TYPE = ArtifactType.DATA

    def load(self, data_type: Type[CustomObject]) -> CustomObject:
        """Load artifact from storage."""
        with fileio.open(f"{self.uri}/data.json", "r") as f:
            data = json.load(f)
        return CustomObject(data)

    def save(self, obj: CustomObject) -> None:
        """Save artifact to storage."""
        with fileio.open(f"{self.uri}/data.json", "w") as f:
            json.dump(obj.data, f)

Using Custom Materializers

@step(output_materializers=CustomMaterializer)
def custom_output_step() -> CustomObject:
    """Step using custom materializer."""
    return CustomObject({"key": "value"})
Materializers are automatically discovered and registered when defined. You can also explicitly specify materializers per output using the output_materializers parameter.

Loading Artifacts

There are multiple ways to load artifacts in ZenML:

Within Steps

Artifacts flow automatically between steps:
@step
def step_a() -> pd.DataFrame:
    return pd.DataFrame({"data": [1, 2, 3]})

@step
def step_b(input_data: pd.DataFrame) -> dict:
    # Artifact automatically loaded and passed
    return {"rows": len(input_data)}

@pipeline
def auto_flow_pipeline():
    data = step_a()
    result = step_b(data)  # Artifact flows automatically

External Artifacts

Load artifacts from outside the current pipeline:
from zenml.artifacts import ExternalArtifact

@step
def use_external_artifact(
    model: Any
) -> dict:
    """Use a model artifact from a different run."""
    predictions = model.predict(test_data)
    return {"predictions": predictions}

@pipeline
def inference_pipeline():
    # Load artifact by name from any previous run
    result = use_external_artifact(
        model=ExternalArtifact(name="production_model")
    )

Using the Client

Load artifacts programmatically outside pipelines:
from zenml.client import Client

# Get the client
client = Client()

# Load latest version of an artifact
artifact = client.get_artifact_version("customer_data")
data = artifact.load()

# Load specific version
artifact = client.get_artifact_version("customer_data", version="v1.0")
data = artifact.load()

# Load by ID
artifact = client.get_artifact_version(name_id_or_prefix="uuid-here")
data = artifact.load()

Artifact Metadata

Attach custom metadata to artifacts:
from zenml import step, get_step_context
from typing import Annotated
from zenml.artifacts import ArtifactConfig

@step
def metadata_step() -> Annotated[
    pd.DataFrame,
    ArtifactConfig(
        run_metadata={
            "num_rows": 1000,
            "num_features": 50,
            "data_source": "s3://bucket/data",
            "preprocessing_time": 120.5
        }
    )
]:
    """Produce artifact with rich metadata."""
    df = load_and_process_data()
    return df

Runtime Metadata

Log metadata dynamically within a step:
from zenml import step, log_artifact_metadata

@step
def dynamic_metadata_step() -> pd.DataFrame:
    """Log metadata at runtime."""
    df = load_data()
    
    # Log metadata about the artifact
    log_artifact_metadata(
        metadata={
            "row_count": len(df),
            "column_count": len(df.columns),
            "memory_usage": df.memory_usage(deep=True).sum(),
            "dtypes": df.dtypes.to_dict()
        }
    )
    
    return df

Artifact Lineage

ZenML automatically tracks the complete lineage of artifacts:
from zenml.client import Client

client = Client()

# Get artifact
artifact = client.get_artifact_version("processed_data")

# Access producer step
producer_step = artifact.step
print(f"Produced by: {producer_step.name}")

# Access pipeline run
pipeline_run = artifact.run
print(f"Pipeline: {pipeline_run.name}")

# Get all steps that consumed this artifact
for consumer_run in client.list_run_steps(
    input_artifact_id=artifact.id
):
    print(f"Consumed by: {consumer_run.name}")

Artifact Visualization

ZenML can automatically generate visualizations for artifacts:
from zenml import step
from zenml.types import HTMLString, MarkdownString
from typing import Annotated
import pandas as pd

@step(enable_artifact_visualization=True)
def visualized_step() -> pd.DataFrame:
    """Step with automatic visualization."""
    df = pd.DataFrame({
        "metric": ["accuracy", "precision", "recall"],
        "value": [0.95, 0.92, 0.89]
    })
    return df  # Automatically creates table visualization

@step
def html_visualization() -> HTMLString:
    """Return HTML for visualization."""
    html = """
    <div>
        <h2>Model Performance</h2>
        <p>Accuracy: 95%</p>
    </div>
    """
    return HTMLString(html)

@step
def markdown_visualization() -> MarkdownString:
    """Return Markdown for visualization."""
    markdown = """
    # Model Report
    
    - Accuracy: 95%
    - Training time: 2 hours
    - Dataset size: 1M rows
    """
    return MarkdownString(markdown)

Artifact Storage

Artifacts are stored in the artifact store configured in your stack:
from zenml.client import Client

client = Client()

# Get active stack's artifact store
stack = client.active_stack
artifact_store = stack.artifact_store

print(f"Artifacts stored at: {artifact_store.path}")

Storage Structure

Artifacts are organized by:
artifact_store/
├── <pipeline_id>/
│   ├── <step_id>/
│   │   ├── <artifact_id>/
│   │   │   ├── <version>/
│   │   │   │   ├── data files

Artifact Caching

Artifacts enable intelligent caching:
@step(enable_cache=True)
def expensive_computation() -> pd.DataFrame:
    """This step's output is cached."""
    # Expensive operation
    result = process_large_dataset()
    return result

@pipeline
def cached_pipeline():
    # First run: executes expensive_computation
    data = expensive_computation()
    
    # Second run: uses cached artifact if inputs/code unchanged
    # expensive_computation is skipped!
Caching compares:
  • Input artifact versions
  • Step code hash
  • Step configuration
  • Parameter values
If everything matches, the cached artifact is reused.

External Artifact Upload

Upload data to ZenML without running a pipeline:
from zenml.artifacts import ExternalArtifact
import pandas as pd

# Upload data directly
df = pd.read_csv("local_data.csv")

artifact = ExternalArtifact.upload_by_value(
    value=df,
    name="uploaded_dataset",
    version="v1",
    tags=["manual_upload", "testing"]
)

print(f"Uploaded artifact: {artifact.id}")

# Later, use in a pipeline
@pipeline
def use_uploaded_data():
    process_data(
        data=ExternalArtifact(name="uploaded_dataset", version="v1")
    )

Best Practices

Small Artifacts

Keep artifacts reasonably sized. For huge datasets, store paths or references instead of full data.

Rich Metadata

Attach meaningful metadata to artifacts. It helps with debugging and understanding lineage.

Semantic Types

Use appropriate artifact types (MODEL, DATA, etc.) to categorize your artifacts meaningfully.

Version Strategically

Use semantic versions for production artifacts. Let auto-versioning handle development.
  • Steps - Learn how steps produce and consume artifacts
  • Pipelines - Understand artifact flow in pipelines
  • Stacks - Configure artifact storage
  • Materializers - Custom serialization for your types

Code Reference

  • ArtifactConfig: src/zenml/artifacts/artifact_config.py:28
  • BaseMaterializer: src/zenml/materializers/base_materializer.py:111
  • ExternalArtifact: src/zenml/artifacts/external_artifact.py

Build docs developers (and LLMs) love