Week 1: LLM Foundations

Resources from Week 1 of the AI Engineering Course

I’m taking a an AI Engineering course and capturing some notes and resources as a go. This week I was trying to understand how tokenizers handle multi-byte characters, like emojis and Chinese characters.

Tokenizers

Modern LLM tokenizers like BPE typically handle multi-byte characters in one of two main ways:

Byte-Level BPE (Most Common)

The most popular approach, used by GPT-2/3/4 and many others, operates directly on bytes rather than characters:

  • UTF-8 encoding first: Multi-byte characters like 🤖 or æ¼¢ are first encoded into UTF-8 bytes
  • Byte-level vocabulary: The tokenizer treats each byte (0-255) as a base unit
  • Merges learned from data: BPE then learns common byte sequences from training data

Example: The emoji 🤖 encodes to 4 UTF-8 bytes: F0 9F A4 96

These might initially be 4 separate tokens If this emoji is common in training data, these bytes might merge into a single token If it’s rare, it stays as multiple byte tokens

Advantages of Byte-Level Approach

Universal handling: Can represent any UTF-8 text, even invalid unicode Fixed vocabulary size: Only 256 base bytes plus learned merges No unknown tokens: Everything can be encoded somehow

The Challenge

Multi-byte characters often get fragmented:

  • Common Latin characters: usually 1 token
  • Kanji/Chinese characters: often 2-3 tokens each
  • Emojis: typically 1-4 tokens

This creates inefficiency for non-English text

Modern Improvements

Newer tokenizers address this:

  • GPT-4’s tokenizer: Better merges for common CJK characters and emojis
  • SentencePiece (used by LLaMA, T5): Can work at character level with normalization
  • Training on diverse multilingual data helps the model learn useful merges for common multi-byte sequences

The key insight is that byte-level BPE is language-agnostic but can be inefficient for scripts that use many bytes per character.

Resources

Collecting linked resources from Week 1 of the AI Engineering Course from ByteByteAI.

Training Data Sources

Controlling Data Inputs

Tokenization

Transformers

The heart of modern LLM.

Fine Tuning