Execution

Execution in Flyte is a first-class concept. Every workflow run — and each node and task within it — has its own execution record with lifecycle phases, inputs, outputs, and error information. Execution types are defined in flyteidl/core/execution.proto and flyteidl/admin/execution.proto.

Execution hierarchy

WorkflowExecution
  └── NodeExecution (one per node in the workflow)
        └── TaskExecution (one per attempt of each task node)

Each level has its own phase, timestamps, and data (inputs/outputs).

WorkflowExecution phases

// flyteidl/core/execution.proto
message WorkflowExecution {
  enum Phase {
    UNDEFINED = 0;   // Initial state, not yet processed
    QUEUED = 1;      // Accepted, waiting for resources
    RUNNING = 2;     // Actively executing nodes
    SUCCEEDING = 3;  // All nodes done, finalizing outputs
    SUCCEEDED = 4;   // Completed successfully
    FAILING = 5;     // Failure detected, cleaning up
    FAILED = 6;      // Terminal failure state
    ABORTED = 7;     // Terminated by user request
    TIMED_OUT = 8;   // Exceeded execution timeout
    ABORTING = 9;    // Abort in progress
  }
}

Phase	Terminal	Description
`UNDEFINED`	No	Default zero value, not yet processed by the engine.
`QUEUED`	No	Execution accepted, waiting for FlytePropeller to pick it up.
`RUNNING`	No	At least one node is executing.
`SUCCEEDING`	No	All nodes completed, engine is binding final outputs.
`SUCCEEDED`	Yes	Execution finished successfully with all outputs available.
`FAILING`	No	A failure was detected, engine is aborting running nodes.
`FAILED`	Yes	Execution failed. Check `closure.error` for details.
`ABORTED`	Yes	Terminated by `TerminateExecution` or an abort signal.
`TIMED_OUT`	Yes	Exceeded the configured execution timeout.
`ABORTING`	No	Termination requested, abort in progress.

NodeExecution phases

message NodeExecution {
  enum Phase {
    UNDEFINED = 0;
    QUEUED = 1;
    RUNNING = 2;
    SUCCEEDED = 3;
    FAILING = 4;
    FAILED = 5;
    ABORTED = 6;
    SKIPPED = 7;         // Branch not taken, gate not triggered
    TIMED_OUT = 8;
    DYNAMIC_RUNNING = 9; // Dynamic workflow node compiling sub-workflow
    RECOVERED = 10;      // Output recovered from cache
  }
}

SKIPPED is specific to branch nodes — branches that weren’t taken are skipped rather than failed. DYNAMIC_RUNNING indicates a dynamic workflow task is generating its sub-workflow graph at runtime. RECOVERED means the node’s output was found in the DataCatalog cache.

TaskExecution phases

message TaskExecution {
  enum Phase {
    UNDEFINED = 0;
    QUEUED = 1;
    RUNNING = 2;
    SUCCEEDED = 3;
    ABORTED = 4;
    FAILED = 5;
    INITIALIZING = 6;          // Pod is being created (ImagePull, Init containers)
    WAITING_FOR_RESOURCES = 7; // Backoff / resource quota exceeded
    RETRYABLE_FAILED = 8;      // Failed but will be retried
  }
}

INITIALIZING covers Kubernetes pod startup states: ErrImagePull, ContainerCreating, PodInitializing. WAITING_FOR_RESOURCES indicates a resource quota or node-pressure backoff.

ExecutionError

message ExecutionError {
  string code = 1;       // Error code grouping
  string message = 2;    // Detailed description with stack trace
  string error_uri = 3;  // URI for full error content

  enum ErrorKind {
    UNKNOWN = 0;
    USER = 1;    // Error caused by user code or configuration
    SYSTEM = 2;  // Error caused by platform/infrastructure
  }
  ErrorKind kind = 4;
  google.protobuf.Timestamp timestamp = 5;
  string worker = 6;     // Worker that generated the error
}

USER errors are caused by user code (e.g. Python exceptions, assertion failures). SYSTEM errors indicate platform-level failures (e.g. K8s node lost, OOM kill from system pressure).

QualityOfService

message QualityOfService {
  enum Tier {
    UNDEFINED = 0;
    HIGH = 1;    // Minimal queueing delay
    MEDIUM = 2;
    LOW = 3;     // Tolerates longer queueing
  }

  oneof designation {
    Tier tier = 1;
    QualityOfServiceSpec spec = 2; // Custom queueing_budget duration
  }
}

Creating an execution

Via REST

curl -X POST https://flyte.example.com/api/v1/executions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "project": "flytesnacks",
    "domain": "development",
    "spec": {
      "launchPlan": {
        "project": "flytesnacks",
        "domain": "development",
        "name": "my_module.my_workflow",
        "version": "v1"
      },
      "metadata": {
        "mode": "SCHEDULED",
        "referenceExecution": null
      },
      "inputs": {
        "literals": {
          "name": {
            "scalar": {
              "primitive": { "stringValue": "world" }
            }
          }
        }
      },
      "qualityOfService": {
        "tier": "HIGH"
      }
    }
  }'

Via grpcurl

grpcurl -plaintext \
  -d '{
    "project": "flytesnacks",
    "domain": "development",
    "spec": {
      "launch_plan": {
        "project": "flytesnacks",
        "domain": "development",
        "name": "my_module.my_workflow",
        "version": "v1"
      }
    }
  }' \
  localhost:81 flyteidl.service.AdminService/CreateExecution

Via flytekit Python SDK

from flytekit.remote import FlyteRemote
from flytekit.configuration import Config

remote = FlyteRemote(
    config=Config.for_endpoint("flyte.example.com"),
    default_project="flytesnacks",
    default_domain="development",
)

# Fetch the launch plan
lp = remote.fetch_launch_plan(
    name="my_module.my_workflow",
    version="v1",
)

# Create the execution
execution = remote.execute(
    lp,
    inputs={"name": "world"},
    execution_name="my-execution-001",
    wait=False,
)
print(f"Execution ID: {execution.id.name}")

Monitoring an execution

Polling for status

import time
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config

remote = FlyteRemote(
    config=Config.for_endpoint("flyte.example.com"),
    default_project="flytesnacks",
    default_domain="development",
)

# Fetch and wait for completion
execution = remote.fetch_execution(name="my-execution-001")
execution = remote.wait(execution, timeout=datetime.timedelta(minutes=30))

if execution.is_done:
    print(f"Final phase: {execution.closure.phase}")
    if execution.closure.phase == 4:  # SUCCEEDED
        print(f"Outputs: {execution.outputs}")
    else:
        print(f"Error: {execution.closure.error.message}")

Via REST

# Get execution status
curl -H "Authorization: Bearer $TOKEN" \
  "https://flyte.example.com/api/v1/executions/flytesnacks/development/my-execution-001"

# List node executions
curl -H "Authorization: Bearer $TOKEN" \
  "https://flyte.example.com/api/v1/node_executions/flytesnacks/development/my-execution-001"

# Get node execution data (inputs and outputs)
curl -H "Authorization: Bearer $TOKEN" \
  "https://flyte.example.com/api/v1/data/node_executions/flytesnacks/development/my-execution-001/n0"

Terminating an execution

Via REST

curl -X DELETE \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cause": "Cancelling stale run"}' \
  "https://flyte.example.com/api/v1/executions/flytesnacks/development/my-execution-001"

Via grpcurl

grpcurl -plaintext \
  -d '{
    "id": {
      "project": "flytesnacks",
      "domain": "development",
      "name": "my-execution-001"
    },
    "cause": "Cancelling stale run"
  }' \
  localhost:81 flyteidl.service.AdminService/TerminateExecution

Via Python SDK

remote.terminate(execution, cause="Cancelling stale run")

Recovering a failed execution

RecoverExecution resumes from the last known failure point, skipping nodes that already succeeded:

curl -X POST https://flyte.example.com/api/v1/executions/recover \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "id": {
      "project": "flytesnacks",
      "domain": "development",
      "name": "my-execution-001"
    },
    "name": "my-execution-001-recovered"
  }'

In recover mode, you cannot change inputs or the workflow version. Flyte replays completed nodes using cached outputs and only re-executes nodes that failed or were not reached.

TaskLog

Each task execution can expose log links for debugging:

message TaskLog {
  string uri = 1;          // URL to the log (e.g. Kubernetes log URL)
  string name = 2;         // display name (e.g. "Kubernetes Logs")
  MessageFormat message_format = 3; // UNKNOWN, CSV, or JSON
  google.protobuf.Duration ttl = 4; // log expiry duration
  bool ShowWhilePending = 5;
  bool HideOnceFinished = 6;
}

Log links are accessible via the GetTaskExecutionData response and are shown in the Flyte UI under each task execution.

gRPC / Protobuf API

Execution hierarchy

WorkflowExecution phases

NodeExecution phases

TaskExecution phases

ExecutionError

QualityOfService

Creating an execution

Via REST

Via grpcurl

Via flytekit Python SDK

Monitoring an execution

Polling for status

Via REST

Terminating an execution

Via REST

Via grpcurl

Via Python SDK

Recovering a failed execution

TaskLog

Build docs developers (and LLMs) love

gRPC / Protobuf API

​Execution hierarchy

​WorkflowExecution phases

​NodeExecution phases

​TaskExecution phases

​ExecutionError

​QualityOfService

​Creating an execution

​Via REST

​Via grpcurl

​Via flytekit Python SDK

​Monitoring an execution

​Polling for status

​Via REST

​Terminating an execution

​Via REST

​Via grpcurl

​Via Python SDK

​Recovering a failed execution

​TaskLog

Build docs developers (and LLMs) love

Execution hierarchy

WorkflowExecution phases

NodeExecution phases

TaskExecution phases

ExecutionError

QualityOfService

Creating an execution

Via REST

Via grpcurl

Via flytekit Python SDK

Monitoring an execution

Polling for status

Via REST

Terminating an execution

Via REST

Via grpcurl

Via Python SDK

Recovering a failed execution

TaskLog