String Interning
String interning is an optimization that ensures identical strings share the same memory location, enabling fast pointer comparison.What is Interning?
Interned strings form an interpreter-global set with two properties:- No two interned strings have the same content
- Two interned strings can be compared using pointer equality (
is)
- Dictionary lookups (attribute access, global variables)
- String comparisons in hot paths
- Memory usage (deduplication)
Example
Two Interning Mechanisms
CPython uses two different mechanisms:1. Singletons
Statically allocated strings that always exist.2. Dynamic Interning
Runtime interning of strings in an interpreter-wide dictionary.Singletons
Latin-1 Single Characters
All 256 single-character latin-1 strings are pre-allocated:Identifier Strings
Common identifiers marked in C source:Singleton Collection
Singletons collected bymake regen-global-objects:
- Scan CPython source for
_Py_IDand_Py_STRmacros - Generate code in Tools/build/generate_global_objects.py
- Produce declaration, initialization, and finalization code
Singleton Storage
Stored in runtime-global table:- Initialized at runtime startup
- Immutable until runtime finalization
- Shared across threads and interpreters without synchronization
The three singleton sets (latin-1 chars,
_Py_ID, _Py_STR) are disjoint - no overlaps.Dynamic Interning
All other interned strings are allocated dynamically.Storage
Stored in interpreter-wide dictionary:- Key and value reference the same object
- One dict entry per unique interned string
Static Allocation Flag
Dynamic strings have:Immortality and Reference Counting
Invariant: Every immortal string is interned.Never use
_Py_SetImmortal() on a string directly! Use _PyUnicode_InternImmortal() instead, which handles interning correctly.Mortal Interned Strings
The converse is NOT true - interned strings can be mortal. For mortal interned strings:- The 2 references from the dict (key + value) are excluded from refcount
unicode_dealloc()removes string from interned dict- At shutdown, dict clearing adds references back before deletion
When to Immortalize
Immortalize strings that live until interpreter shutdown:- Strings in code objects
- Strings in
marshaldata - Strings in compiler-generated constants
- Even with hot reloading or
eval(), identifier count stays low - Immortalizing prevents repeated allocation/deallocation
Internal API
Three internal interning functions:_PyUnicode_InternMortal
- Takes ownership of reference (steals)
- Returns new reference via pointer update
- “Reference neutral” (refcount unchanged from caller’s view)
_PyUnicode_InternImmortal
- Takes ownership of reference
- Immortalizes the result
- Returns new reference
_PyUnicode_InternStatic
- Only for
_Py_STR,_Py_ID, or single-byte strings - Not for general use
All intern functions take a pointer to
PyObject* and modify it in place. This enables reference ownership transfer.Reference Neutrality
- Steals the incoming reference (consumes it)
- Provides a new reference (creates it)
- Net effect: caller still has 1 reference
Critical: Never call intern functions with a borrowed reference! Always own the reference you pass.
Interning State
Stored in_PyUnicode_STATE(s).interned:
SSTATE_NOT_INTERNED(0)SSTATE_INTERNED_MORTAL(1)SSTATE_INTERNED_IMMORTAL(2)SSTATE_INTERNED_IMMORTAL_STATIC(3)
State Transitions
For dynamically allocated strings:Using
_PyUnicode_InternStatic on dynamically allocated strings is an error.Performance Impact
Dictionary Lookups
Without interning:Memory Savings
Explicit Interning
Python exposes interning viasys.intern():
When to Use
Explicit interning useful for:- Many string comparisons
- Large number of duplicate strings
- Performance-critical code paths
Implementation Files
C Implementation
- Objects/unicodeobject.c - String object and interning logic
- Python/pylifecycle.c - Runtime initialization (intern singletons)
Singleton Generation
- Tools/build/generate_global_objects.py - Collect and generate singletons
- Include/internal/pycore_global_objects.h - Generated singleton declarations
Related Topics
- Source Code Structure - Where string implementation lives
- Garbage Collector - String memory management
- Code Objects - Interned strings in bytecode
