Skip to main content

Overview

The simulator generates traces in Chrome’s Trace Event Format for visualization in Perfetto. This allows you to see exactly what instructions are executing on each engine slot and how your scratch variables change over time.
Trace visualization only works in Chrome. If you encounter issues, you can drag trace.json directly onto https://ui.perfetto.dev/

What Tracing Does

When tracing is enabled, the simulator:
  • Records every instruction executed on each engine slot (alu, load, store, flow)
  • Tracks scratch variable updates with their values over time
  • Generates a trace.json file in Chrome Trace Event Format
  • Organizes data by core, engine type, and slot number

Generating a Trace

Run the trace test to generate a full-scale trace:
python perf_takehome.py Tests.test_kernel_trace
This test runs with forest_height=10, rounds=16, and batch_size=256 — the same parameters used for performance evaluation.

How It Works

The trace is generated by passing trace=True to the test function:
problem.py:178-204
value_trace = {}
machine = Machine(
    mem,
    kb.instrs,
    kb.debug_info(),
    n_cores=N_CORES,
    value_trace=value_trace,
    trace=True,  # Enables trace generation
)

Hot-Reloading Workflow

1

Run the trace test

python perf_takehome.py Tests.test_kernel_trace
This generates trace.json in your working directory.
2

Start the trace server

In a separate terminal tab:
python watch_trace.py
This starts a local server on port 8000 and opens your browser.
3

Open Perfetto

Click “Open Perfetto” in the browser tab that opens.
4

Make changes and re-run

  • Keep the browser tab open
  • Modify your kernel in perf_takehome.py
  • Re-run the trace test
  • The trace view automatically refreshes with your new trace
The hot-reloading workflow lets you iterate quickly without manually loading trace files each time.

Reading the Trace Output

Process Organization

The trace organizes execution into processes and threads:

Core Processes

Each core (0 to N_CORES-1) has its own process showing engine execution

Scratch Processes

Each core has a “Core N Scratch” process showing variable updates

Engine Slots

Within each core process, you’ll see threads for each engine slot:
problem.py:48-55
SLOT_LIMITS = {
    "alu": 12,
    "valu": 6,
    "load": 2,
    "store": 2,
    "flow": 1,
    "debug": 64,
}
Each slot is labeled as {engine}-{slot_number}, such as:
  • alu-0 through alu-11 (12 scalar ALU slots)
  • load-0 and load-1 (2 load slots)
  • store-0 and store-1 (2 store slots)
  • flow-0 (1 flow control slot)

Scratch Variables

In the “Core N Scratch” process, each scratch variable gets its own thread showing when and how it changes:
perf_takehome.py:94-109
tmp1 = self.alloc_scratch("tmp1")
tmp2 = self.alloc_scratch("tmp2")
tmp3 = self.alloc_scratch("tmp3")
init_vars = [
    "rounds",
    "n_nodes",
    "batch_size",
    "forest_height",
    "forest_values_p",
    "inp_indices_p",
    "inp_values_p",
]
Each event shows the variable’s value at that cycle.

Understanding the Timeline

The X-axis represents cycle numbers. Each cycle can execute multiple instructions in parallel across different engine slots.

What to Look For

Instructions in the same cycle on different engine slots execute in parallel. Look for opportunities to pack more operations into each cycle.
Gaps in engine slots indicate unused parallelism. Can you move instructions to fill these slots?
If one engine is constantly full while others are empty, you may have a bottleneck on that engine.
Watch load and store slots to understand memory access patterns and potential optimization opportunities.

Trace Format Details

The trace uses Chrome’s Trace Event Format. Key event types:
problem.py:151-177
def setup_trace(self):
    self.trace = open("trace.json", "w")
    self.trace.write("[")
    tid_counter = 0
    self.tids = {}
    for ci, core in enumerate(self.cores):
        self.trace.write(
            f'{{"name": "process_name", "ph": "M", "pid": {ci}, "tid": 0, "args": {{"name":"Core {ci}"}}}},' + '\n'
        )
        for name, limit in SLOT_LIMITS.items():
            if name == "debug":
                continue
            for i in range(limit):
                tid_counter += 1
                self.trace.write(
                    f'{{"name": "thread_name", "ph": "M", "pid": {ci}, "tid": {tid_counter}, "args": {{"name":"{name}-{i}"}}}},' + '\n'
                )
                self.tids[(ci, name, i)] = tid_counter
You can extend trace_post_step() or trace_slot() in problem.py to add custom trace information if needed.

Troubleshooting

If the browser tab doesn’t open automatically, manually navigate to http://localhost:8000

Common Issues

IssueSolution
Browser tab opens but shows errorEnsure you’ve run the trace test first to generate trace.json
Trace doesn’t refreshCheck that watch_trace.py is still running and re-run the test
Can’t see scratch variablesVerify you’re using alloc_scratch() with a name parameter
Empty traceMake sure trace=True is passed to the Machine constructor

Example: Debugging with Traces

When optimizing, use traces to:
  1. Identify unused cycles — Look for instruction bundles with empty engine slots
  2. Verify VLIW packing — Confirm multiple operations execute in the same cycle
  3. Track data flow — Follow a value through operations by watching scratch variables
  4. Compare implementations — Run traces before and after changes to visualize improvements
The trace viewer’s search and filter features are invaluable for focusing on specific variables or instruction types.

Build docs developers (and LLMs) love