Execution in Flyte is a first-class concept. Every workflow run — and each node and task within it — has its own execution record with lifecycle phases, inputs, outputs, and error information.
Execution types are defined in flyteidl/core/execution.proto and flyteidl/admin/execution.proto.
Execution hierarchy
WorkflowExecution
└── NodeExecution (one per node in the workflow)
└── TaskExecution (one per attempt of each task node)
Each level has its own phase, timestamps, and data (inputs/outputs).
WorkflowExecution phases
// flyteidl/core/execution.proto
message WorkflowExecution {
enum Phase {
UNDEFINED = 0; // Initial state, not yet processed
QUEUED = 1; // Accepted, waiting for resources
RUNNING = 2; // Actively executing nodes
SUCCEEDING = 3; // All nodes done, finalizing outputs
SUCCEEDED = 4; // Completed successfully
FAILING = 5; // Failure detected, cleaning up
FAILED = 6; // Terminal failure state
ABORTED = 7; // Terminated by user request
TIMED_OUT = 8; // Exceeded execution timeout
ABORTING = 9; // Abort in progress
}
}
| Phase | Terminal | Description |
|---|
UNDEFINED | No | Default zero value, not yet processed by the engine. |
QUEUED | No | Execution accepted, waiting for FlytePropeller to pick it up. |
RUNNING | No | At least one node is executing. |
SUCCEEDING | No | All nodes completed, engine is binding final outputs. |
SUCCEEDED | Yes | Execution finished successfully with all outputs available. |
FAILING | No | A failure was detected, engine is aborting running nodes. |
FAILED | Yes | Execution failed. Check closure.error for details. |
ABORTED | Yes | Terminated by TerminateExecution or an abort signal. |
TIMED_OUT | Yes | Exceeded the configured execution timeout. |
ABORTING | No | Termination requested, abort in progress. |
NodeExecution phases
message NodeExecution {
enum Phase {
UNDEFINED = 0;
QUEUED = 1;
RUNNING = 2;
SUCCEEDED = 3;
FAILING = 4;
FAILED = 5;
ABORTED = 6;
SKIPPED = 7; // Branch not taken, gate not triggered
TIMED_OUT = 8;
DYNAMIC_RUNNING = 9; // Dynamic workflow node compiling sub-workflow
RECOVERED = 10; // Output recovered from cache
}
}
SKIPPED is specific to branch nodes — branches that weren’t taken are skipped rather than failed. DYNAMIC_RUNNING indicates a dynamic workflow task is generating its sub-workflow graph at runtime. RECOVERED means the node’s output was found in the DataCatalog cache.
TaskExecution phases
message TaskExecution {
enum Phase {
UNDEFINED = 0;
QUEUED = 1;
RUNNING = 2;
SUCCEEDED = 3;
ABORTED = 4;
FAILED = 5;
INITIALIZING = 6; // Pod is being created (ImagePull, Init containers)
WAITING_FOR_RESOURCES = 7; // Backoff / resource quota exceeded
RETRYABLE_FAILED = 8; // Failed but will be retried
}
}
INITIALIZING covers Kubernetes pod startup states: ErrImagePull, ContainerCreating, PodInitializing. WAITING_FOR_RESOURCES indicates a resource quota or node-pressure backoff.
ExecutionError
message ExecutionError {
string code = 1; // Error code grouping
string message = 2; // Detailed description with stack trace
string error_uri = 3; // URI for full error content
enum ErrorKind {
UNKNOWN = 0;
USER = 1; // Error caused by user code or configuration
SYSTEM = 2; // Error caused by platform/infrastructure
}
ErrorKind kind = 4;
google.protobuf.Timestamp timestamp = 5;
string worker = 6; // Worker that generated the error
}
USER errors are caused by user code (e.g. Python exceptions, assertion failures). SYSTEM errors indicate platform-level failures (e.g. K8s node lost, OOM kill from system pressure).
QualityOfService
message QualityOfService {
enum Tier {
UNDEFINED = 0;
HIGH = 1; // Minimal queueing delay
MEDIUM = 2;
LOW = 3; // Tolerates longer queueing
}
oneof designation {
Tier tier = 1;
QualityOfServiceSpec spec = 2; // Custom queueing_budget duration
}
}
Creating an execution
Via REST
curl -X POST https://flyte.example.com/api/v1/executions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"project": "flytesnacks",
"domain": "development",
"spec": {
"launchPlan": {
"project": "flytesnacks",
"domain": "development",
"name": "my_module.my_workflow",
"version": "v1"
},
"metadata": {
"mode": "SCHEDULED",
"referenceExecution": null
},
"inputs": {
"literals": {
"name": {
"scalar": {
"primitive": { "stringValue": "world" }
}
}
}
},
"qualityOfService": {
"tier": "HIGH"
}
}
}'
Via grpcurl
grpcurl -plaintext \
-d '{
"project": "flytesnacks",
"domain": "development",
"spec": {
"launch_plan": {
"project": "flytesnacks",
"domain": "development",
"name": "my_module.my_workflow",
"version": "v1"
}
}
}' \
localhost:81 flyteidl.service.AdminService/CreateExecution
Via flytekit Python SDK
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config
remote = FlyteRemote(
config=Config.for_endpoint("flyte.example.com"),
default_project="flytesnacks",
default_domain="development",
)
# Fetch the launch plan
lp = remote.fetch_launch_plan(
name="my_module.my_workflow",
version="v1",
)
# Create the execution
execution = remote.execute(
lp,
inputs={"name": "world"},
execution_name="my-execution-001",
wait=False,
)
print(f"Execution ID: {execution.id.name}")
Monitoring an execution
Polling for status
import time
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config
remote = FlyteRemote(
config=Config.for_endpoint("flyte.example.com"),
default_project="flytesnacks",
default_domain="development",
)
# Fetch and wait for completion
execution = remote.fetch_execution(name="my-execution-001")
execution = remote.wait(execution, timeout=datetime.timedelta(minutes=30))
if execution.is_done:
print(f"Final phase: {execution.closure.phase}")
if execution.closure.phase == 4: # SUCCEEDED
print(f"Outputs: {execution.outputs}")
else:
print(f"Error: {execution.closure.error.message}")
Via REST
# Get execution status
curl -H "Authorization: Bearer $TOKEN" \
"https://flyte.example.com/api/v1/executions/flytesnacks/development/my-execution-001"
# List node executions
curl -H "Authorization: Bearer $TOKEN" \
"https://flyte.example.com/api/v1/node_executions/flytesnacks/development/my-execution-001"
# Get node execution data (inputs and outputs)
curl -H "Authorization: Bearer $TOKEN" \
"https://flyte.example.com/api/v1/data/node_executions/flytesnacks/development/my-execution-001/n0"
Terminating an execution
Via REST
curl -X DELETE \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"cause": "Cancelling stale run"}' \
"https://flyte.example.com/api/v1/executions/flytesnacks/development/my-execution-001"
Via grpcurl
grpcurl -plaintext \
-d '{
"id": {
"project": "flytesnacks",
"domain": "development",
"name": "my-execution-001"
},
"cause": "Cancelling stale run"
}' \
localhost:81 flyteidl.service.AdminService/TerminateExecution
Via Python SDK
remote.terminate(execution, cause="Cancelling stale run")
Recovering a failed execution
RecoverExecution resumes from the last known failure point, skipping nodes that already succeeded:
curl -X POST https://flyte.example.com/api/v1/executions/recover \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"id": {
"project": "flytesnacks",
"domain": "development",
"name": "my-execution-001"
},
"name": "my-execution-001-recovered"
}'
In recover mode, you cannot change inputs or the workflow version. Flyte replays completed nodes using cached outputs and only re-executes nodes that failed or were not reached.
TaskLog
Each task execution can expose log links for debugging:
message TaskLog {
string uri = 1; // URL to the log (e.g. Kubernetes log URL)
string name = 2; // display name (e.g. "Kubernetes Logs")
MessageFormat message_format = 3; // UNKNOWN, CSV, or JSON
google.protobuf.Duration ttl = 4; // log expiry duration
bool ShowWhilePending = 5;
bool HideOnceFinished = 6;
}
Log links are accessible via the GetTaskExecutionData response and are shown in the Flyte UI under each task execution.