Tokenization
Llama 2 uses SentencePiece for tokenization, a language-independent subword tokenizer that can handle any language and produces a consistent vocabulary.SentencePiece Tokenizer
TheTokenizer class wraps the SentencePiece model:
n_words: Vocabulary size (typically 32,000 tokens)bos_id: Beginning-of-sequence token IDeos_id: End-of-sequence token IDpad_id: Padding token ID
Encoding and Decoding
Encoding Text
Theencode method converts text to token IDs with optional BOS/EOS tokens:
Decoding Tokens
Thedecode method converts token IDs back to text:
Special Tokens for Chat
Llama 2 Chat models use special tags to format conversational prompts:Chat Format Structure
Basic instruction format:Chat Tokenization
Thechat_completion method formats dialogs with special tags:
<<SYS>> tags and prepended to the first user message.
Full dialog encoding: