Skip to main content

Lexical Analysis

A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer (also known as the tokenizer). This chapter describes how the lexical analyzer produces these tokens. The lexical analyzer determines the program text’s encoding (UTF-8 by default), and decodes the text into source characters. If the text cannot be decoded, a SyntaxError is raised.

Line Structure

A Python program is divided into a number of logical lines.

Logical Lines

The end of a logical line is represented by the token NEWLINE. Statements cannot cross logical line boundaries except where NEWLINE is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit or implicit line joining rules.

Physical Lines

A physical line is a sequence of characters terminated by one of the following end-of-line sequences:
  • The Unix form using ASCII LF (linefeed)
  • The Windows form using the ASCII sequence CR LF (return followed by linefeed)
  • The Classic Mac OS form using the ASCII CR (return) character
Regardless of platform, each of these sequences is replaced by a single ASCII LF (linefeed) character. The end of input also serves as an implicit terminator for the final physical line.
newline: <ASCII LF> | <ASCII CR> <ASCII LF> | <ASCII CR>

Comments

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax.

Encoding Declarations

If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration. The first group of this expression names the encoding of the source code file. The recommended forms of an encoding expression are:
# -*- coding: <encoding-name> -*-
which is recognized also by GNU Emacs, and:
# vim:fileencoding=<encoding-name>
which is recognized by Bram Moolenaar’s VIM. If no encoding declaration is found, the default encoding is UTF-8. If the implicit or explicit encoding of a file is UTF-8, an initial UTF-8 byte-order mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error.

Explicit Line Joining

Two or more physical lines may be joined into logical lines using backslash characters (\), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character.
if 1900 < year < 2100 and 1 <= month <= 12 \
   and 1 <= day <= 31 and 0 <= hour < 24 \
   and 0 <= minute < 60 and 0 <= second < 60:   # Looks like a valid date
        return 1
A line ending in a backslash cannot carry a comment. A backslash does not continue a comment. A backslash does not continue a token except for string literals. A backslash is illegal elsewhere on a line outside a string literal.

Implicit Line Joining

Expressions in parentheses, square brackets or curly braces can be split over more than one physical line without using backslashes:
month_names = ['Januari', 'Februari', 'Maart',      # These are the
               'April',   'Mei',      'Juni',       # Dutch names
               'Juli',    'Augustus', 'September',  # for the months
               'Oktober', 'November', 'December']   # of the year
Implicitly continued lines can carry comments. The indentation of the continuation lines is not important. Blank continuation lines are allowed. There is no NEWLINE token between implicit continuation lines.

Blank Lines

A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated). During interactive input of statements, handling of a blank line may differ depending on the implementation of the read-eval-print loop.

Indentation

Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the indentation level of the line, which in turn is used to determine the grouping of statements. Tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight. The total number of spaces preceding the first non-blank character then determines the line’s indentation. Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes the meaning dependent on the worth of a tab in spaces; a TabError is raised in that case. The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack. Before the first line of the file is read, a single zero is pushed on the stack. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated.

Names (Identifiers and Keywords)

NAME tokens represent identifiers, keywords, and soft keywords. Names are composed of the following characters:
  • Uppercase and lowercase letters (A-Z and a-z)
  • The underscore (_)
  • Digits (0 through 9), which cannot appear as the first character
  • Non-ASCII characters (see below for details)
Names must contain at least one character, but have no upper length limit. Case is significant.
NAME:          name_start name_continue*
name_start:    "a"..."z" | "A"..."Z" | "_" | <non-ASCII character>
name_continue: name_start | "0"..."9"
identifier:    <NAME, except keywords>

Keywords

The following names are used as reserved words, or keywords of the language, and cannot be used as ordinary identifiers:
False      await      else       import     pass
None       break      except     in         raise
True       class      finally    is         return
and        continue   for        lambda     try
as         def        from       nonlocal   while
assert     del        global     not        with
async      elif       if         or         yield

Soft Keywords

Some names are only reserved under specific contexts. These are known as soft keywords:
  • match, case, and _, when used in the match statement
  • type, when used in the type statement
  • lazy, when used before an import statement

Reserved Classes of Identifiers

_*
pattern
Not imported by from module import *.
_
pattern
In a case pattern within a match statement, _ denotes a wildcard. In the interactive interpreter, it holds the result of the last evaluation.
__*__
pattern
System-defined names, informally known as “dunder” names. These names are defined by the interpreter and its implementation.
__*
pattern
Class-private names. Names in this category, when used within the context of a class definition, are re-written to use a mangled form to help avoid name clashes.

Literals

Literals are notations for constant values of some built-in types.

String and Bytes Literals

String literals are text enclosed in single quotes (') or double quotes ("):
"spam"
'eggs'
The quote used to start the literal also terminates it. Inside a string literal, the backslash (\) character introduces an escape sequence:
print("Say \"Hello\" to everyone!")
# Output: Say "Hello" to everyone!

Triple-Quoted Strings

Strings can also be enclosed in matching groups of three single or double quotes:
"""This is a triple-quoted string."""
In triple-quoted literals, unescaped quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. Unescaped newlines are also allowed and retained.

String Prefixes

String literals can have an optional prefix that influences how the content of the literal is parsed:
  • b: Bytes literal
  • r: Raw string
  • f: Formatted string literal (“f-string”)
  • t: Template string literal (“t-string”)
  • u: No effect (allowed for backwards compatibility)
Prefixes are case-insensitive. The r prefix can be combined with f, t or b.

Escape Sequences

Unless an r or R prefix is present, escape sequences in string and bytes literals are interpreted according to rules similar to those used by Standard C:
Escape SequenceMeaning
\<newline>Ignored end of line
\\Backslash
\'Single quote
\"Double quote
\aASCII Bell (BEL)
\bASCII Backspace (BS)
\fASCII Formfeed (FF)
\nASCII Linefeed (LF)
\rASCII Carriage Return (CR)
\tASCII Horizontal Tab (TAB)
\vASCII Vertical Tab (VT)
\oooOctal character
\xhhHexadecimal character
\N{name}Named Unicode character
\uxxxxHexadecimal Unicode character (16-bit)
\UxxxxxxxxHexadecimal Unicode character (32-bit)

Numeric Literals

NUMBER tokens represent numeric literals, of which there are three types: integers, floating-point numbers, and imaginary numbers.
NUMBER: integer | floatnumber | imagnumber
Numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator - and the literal 1.

Integer Literals

Integer literals denote whole numbers:
7
2147483647
100_000_000_000
Underscores can be used to group digits for enhanced readability, and are ignored for determining the numeric value of the literal. Integers can be specified in binary (base 2), octal (base 8), or hexadecimal (base 16) using the prefixes 0b, 0o and 0x, respectively:
0b100110111
0o177
0xdeadbeef
integer:      decinteger | bininteger | octinteger | hexinteger | zerointeger
decinteger:   nonzerodigit (["_"] digit)*
bininteger:   "0" ("b" | "B") (["_"] bindigit)+
octinteger:   "0" ("o" | "O") (["_"] octdigit)+
hexinteger:   "0" ("x" | "X") (["_"] hexdigit)+
zerointeger:  "0"+ (["_"] "0")*

Floating-Point Literals

Floating-point literals denote approximations of real numbers:
3.14
10.
.001
1e100
3.14e-10
They consist of integer and fraction parts, separated by a decimal point. Optionally, the integer and fraction may be followed by an exponent:
floatnumber:
   | digitpart "." [digitpart] [exponent]
   | "." digitpart [exponent]
   | digitpart exponent
digitpart: digit (["_"] digit)*
exponent:  ("e" | "E") ["+" | "-"] digitpart

Imaginary Literals

Imaginary literals denote complex numbers with a zero real part:
3.14j
10j
.001j
1e100j
The number before the j has the same syntax as a floating-point literal. The j suffix is case-insensitive.
imagnumber: (floatnumber | digitpart) ("j" | "J")

Operators and Delimiters

The following tokens are operators:
+       -       *       **      /       //      %      @
<<      >>      &       |       ^       ~       :=
<       >       <=      >=      ==      !=
The following tokens serve as delimiters in the grammar:
(       )       [       ]       {       }
,       :       .       ;       @       =       ->
+=      -=      *=      /=      //=     %=      @=
&=      |=      ^=      >>=     <<=     **=
The token ... (three consecutive periods) has a special meaning as an Ellipsis literal.

Build docs developers (and LLMs) love