Lexical Analysis
A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer (also known as the tokenizer). This chapter describes how the lexical analyzer produces these tokens. The lexical analyzer determines the program text’s encoding (UTF-8 by default), and decodes the text into source characters. If the text cannot be decoded, aSyntaxError is raised.
Line Structure
A Python program is divided into a number of logical lines.Logical Lines
The end of a logical line is represented by the tokenNEWLINE. Statements cannot cross logical line boundaries except where NEWLINE is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit or implicit line joining rules.
Physical Lines
A physical line is a sequence of characters terminated by one of the following end-of-line sequences:- The Unix form using ASCII LF (linefeed)
- The Windows form using the ASCII sequence CR LF (return followed by linefeed)
- The Classic Mac OS form using the ASCII CR (return) character
Comments
A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax.
Encoding Declarations
If a comment in the first or second line of the Python script matches the regular expressioncoding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration. The first group of this expression names the encoding of the source code file.
The recommended forms of an encoding expression are:
b'\xef\xbb\xbf') is ignored rather than being a syntax error.
Explicit Line Joining
Two or more physical lines may be joined into logical lines using backslash characters (\), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character.
Implicit Line Joining
Expressions in parentheses, square brackets or curly braces can be split over more than one physical line without using backslashes:Blank Lines
A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., noNEWLINE token is generated). During interactive input of statements, handling of a blank line may differ depending on the implementation of the read-eval-print loop.
Indentation
Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the indentation level of the line, which in turn is used to determine the grouping of statements. Tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight. The total number of spaces preceding the first non-blank character then determines the line’s indentation. Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes the meaning dependent on the worth of a tab in spaces; aTabError is raised in that case.
The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack. Before the first line of the file is read, a single zero is pushed on the stack. The numbers pushed on the stack will always be strictly increasing from bottom to top.
At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated.
Names (Identifiers and Keywords)
NAME tokens represent identifiers, keywords, and soft keywords.
Names are composed of the following characters:
- Uppercase and lowercase letters (
A-Zanda-z) - The underscore (
_) - Digits (
0through9), which cannot appear as the first character - Non-ASCII characters (see below for details)
Keywords
The following names are used as reserved words, or keywords of the language, and cannot be used as ordinary identifiers:Soft Keywords
Some names are only reserved under specific contexts. These are known as soft keywords:match,case, and_, when used in the match statementtype, when used in the type statementlazy, when used before an import statement
Reserved Classes of Identifiers
Not imported by
from module import *.In a
case pattern within a match statement, _ denotes a wildcard. In the interactive interpreter, it holds the result of the last evaluation.System-defined names, informally known as “dunder” names. These names are defined by the interpreter and its implementation.
Class-private names. Names in this category, when used within the context of a class definition, are re-written to use a mangled form to help avoid name clashes.
Literals
Literals are notations for constant values of some built-in types.String and Bytes Literals
String literals are text enclosed in single quotes (') or double quotes ("):
\) character introduces an escape sequence:
Triple-Quoted Strings
Strings can also be enclosed in matching groups of three single or double quotes:String Prefixes
String literals can have an optional prefix that influences how the content of the literal is parsed:b: Bytes literalr: Raw stringf: Formatted string literal (“f-string”)t: Template string literal (“t-string”)u: No effect (allowed for backwards compatibility)
r prefix can be combined with f, t or b.
Escape Sequences
Unless anr or R prefix is present, escape sequences in string and bytes literals are interpreted according to rules similar to those used by Standard C:
| Escape Sequence | Meaning |
|---|---|
\<newline> | Ignored end of line |
\\ | Backslash |
\' | Single quote |
\" | Double quote |
\a | ASCII Bell (BEL) |
\b | ASCII Backspace (BS) |
\f | ASCII Formfeed (FF) |
\n | ASCII Linefeed (LF) |
\r | ASCII Carriage Return (CR) |
\t | ASCII Horizontal Tab (TAB) |
\v | ASCII Vertical Tab (VT) |
\ooo | Octal character |
\xhh | Hexadecimal character |
\N{name} | Named Unicode character |
\uxxxx | Hexadecimal Unicode character (16-bit) |
\Uxxxxxxxx | Hexadecimal Unicode character (32-bit) |
Numeric Literals
NUMBER tokens represent numeric literals, of which there are three types: integers, floating-point numbers, and imaginary numbers.
-1 is actually an expression composed of the unary operator - and the literal 1.
Integer Literals
Integer literals denote whole numbers:0b, 0o and 0x, respectively:
Floating-Point Literals
Floating-point literals denote approximations of real numbers:Imaginary Literals
Imaginary literals denote complex numbers with a zero real part:j has the same syntax as a floating-point literal. The j suffix is case-insensitive.
Operators and Delimiters
The following tokens are operators:... (three consecutive periods) has a special meaning as an Ellipsis literal.