Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

Training a language model from scratch using PyTorch and basic libraries is achievable with fundamental transformer architecture components (tokenizer, model...

By Sean Weldon

Training Language Models from First Principles: An Educational Framework for Transformer Implementation

Abstract

This synthesis examines the fundamental architecture and training methodology for constructing language models from scratch using PyTorch and minimal dependencies. The work demonstrates that transformer-based models can be built from four essential components: tokenization systems, model architecture, training loops, and inference mechanisms. Through implementation of a GPT2-based model with 1.8 million parameters trained on character-level tokens from a one-million-character corpus, the analysis reveals that approximately 80% of production-level model development relies on these foundational building blocks. Key findings include the critical relationship between vocabulary size and minimum training data requirements (vocab_size² tokens), the modular composition of transformer blocks, and quantifiable loss progression patterns that indicate learning stages. This educational framework provides researchers with practical understanding of parameter allocation, architectural trade-offs, and the distinction between base model training and post-training enhancements for reasoning and multimodal capabilities.

1. Introduction

The proliferation of Large Language Models has transformed natural language processing capabilities, yet the fundamental mechanisms underlying these systems remain poorly understood by many practitioners. While production models contain hundreds of billions of parameters and require extensive computational infrastructure, the core architectural principles and training procedures remain consistent across scales. This synthesis presents a comprehensive examination of training language models from first principles, focusing on the essential components that enable next-token prediction.

The central thesis posits that transformer-based language models can be constructed using four fundamental building blocks: tokenization systems that convert text to numerical representations, model architecture implementing multi-head self-attention mechanisms, training loops with appropriate optimization strategies, and inference procedures that generate coherent text. Understanding these components provides researchers with foundational knowledge applicable to production-scale development, as the core principles demonstrated in a 1.8-million-parameter model extend directly to systems with billions of parameters.

This analysis examines a practical implementation trained on approximately one million characters using character-level tokenization to produce a 65-token vocabulary. The implementation requires modest computational resources—16GB RAM or Google Colab with free GPU access—and employs PyTorch with basic libraries rather than high-level abstraction frameworks like Hugging Face Transformers. This pedagogical approach, inspired by Andrej Karpathy's nanoGPT project, emphasizes transparency in architectural decisions and explicit parameter allocation. The subsequent sections establish theoretical foundations of transformer architecture, analyze tokenization strategies and their scaling implications, examine training dynamics through loss progression, and discuss extensions to multimodal and reasoning-enhanced models.

2. Background and Related Work

The decoder-only causal transformer architecture, exemplified by the GPT2 configuration, represents the dominant paradigm for autoregressive language modeling. Unlike recurrent architectures that process sequences sequentially, transformers employ parallel attention mechanisms that enable efficient training on modern hardware accelerators. This architectural choice fundamentally shapes both training efficiency and model capability.

The transformer comprises four synergistic components: multi-head self-attention mechanisms that allow models to attend to different linguistic features simultaneously, feed-forward networks (MLP blocks) that transform learned representations, residual connections that enable gradient flow through deep networks by adding incremental changes rather than complete transformations, and layer normalization that stabilizes activation magnitudes during training. These components combine in modular transformer blocks, each containing independent parameters that process information through successive layers.

The implementation examined here draws directly from the GPT2 architecture while operating at reduced scale for educational purposes. Production models employ additional optimizations—representing approximately 20% of development effort—focused on training efficiency and computational scaling. However, the fundamental architecture and training procedures remain consistent, validating the pedagogical approach of understanding core mechanisms through simplified implementations.

3. Core Analysis

3.1 Tokenization Strategy and Data Requirements

The selection of tokenization strategy fundamentally determines model efficiency and training data requirements. This implementation employs character-level tokenization, which yields only 65 unique tokens from the Shakespeare corpus—comprising lowercase letters, uppercase letters, punctuation, and special characters. While this approach simplifies implementation, it reveals critical scaling relationships.

The bigram concept proves essential for understanding data requirements: transformers must observe sufficient combinations of consecutive tokens to learn meaningful patterns. With 65 tokens, the model faces 4,225 possible bigrams (65² combinations). A practical rule of thumb suggests requiring vocab_size² tokens of training data minimum for adequate coverage. The one-million-character corpus provides approximately 237× this minimum threshold, enabling reasonable convergence.

In contrast, production models employ Byte-Pair Encoding (BPE), which typically generates vocabularies of 50,000 tokens. This approach offers substantial advantages: tokens carry semantic meaning (encoding common words or subword units), reducing the number of inference steps required for generation. However, BPE necessitates significantly larger training datasets—approximately 2.5 billion tokens minimum by the vocab_size² heuristic—and more sophisticated data processing infrastructure. Character-level tokenization, while pedagogically valuable, does not scale effectively to production systems due to the lack of semantic content in individual tokens.

3.2 Architecture Design and Parameter Allocation

The model architecture employs six transformer blocks, each containing multi-head attention with six heads and an embedding dimension of 384. This configuration yields approximately 1.8 million total parameters, distributed across three primary components: token embeddings (25,000 parameters), positional embeddings (98,304 parameters), and transformer block parameters (1.2 million parameters).

The embedding table size follows directly from vocabulary and embedding dimensions: 65 tokens × 384 dimensions = 24,960 parameters. This calculation reveals scaling implications—GPT2's 50,000-token vocabulary with the same embedding dimension would require 19.2 million parameters for embeddings alone, exceeding the entire parameter count of this educational model by an order of magnitude. The positional embedding table similarly scales with context window size: 256 positions × 384 dimensions = 98,304 parameters.

Each transformer block implements the standard architecture: layer normalization followed by multi-head attention, another layer normalization, and an MLP block. The MLP block typically expands to 4× the embedding dimension in its hidden layer (384 → 1,536 → 384), enabling complex feature transformations. The attention mechanism requires four parameter matrices per head (query, key, value, and output projection), each of size embedding_dimension². This quadratic scaling with embedding dimension represents a significant computational consideration for larger models.

Residual connections prevent activation explosion by adding small incremental changes to the input rather than replacing it entirely. Without these connections, successive layer transformations could cause activations to grow exponentially (e.g., 0.5 → 2.5 → 12.5 → 62.5), destabilizing training. Layer normalization complements residual connections by scaling activations to maintain consistent magnitudes across layers, further stabilizing the training process.

3.3 Training Dynamics and Loss Progression

The training procedure employs cross-entropy loss for next-token prediction, where the model predicts tokens t₁ through tₙ₊₁ given tokens t₀ through tₙ. The AdamW optimizer with cosine decay learning rate schedule represents the production standard. The learning rate schedule implements warm-up for 100 steps—starting from a low rate and gradually increasing—to prevent model instability during initialization when parameters are randomly initialized. Following warm-up, cosine decay reduces the learning rate over 5,000 total training steps.

The batch configuration processes 64 sequences of 256 tokens each, yielding 16,384 tokens per training step. This batch size balances computational efficiency with gradient stability. Critically, both training and validation losses must be monitored to detect overfitting, where training loss continues decreasing while validation loss increases.

Loss progression provides quantifiable indicators of learning stages. Initial random prediction yields loss of approximately 4.17 (log₂(65) for uniform distribution over 65 tokens). As the model learns character frequencies, loss decreases to approximately 3.3. Learning common character patterns reduces loss to 2.5, while word formation capability corresponds to loss of 1.5-2.0. Loss between 1.0-1.2 indicates decent performance with coherent text generation. Loss below 1.0 typically signals overfitting to the training data. Training on a Google Colab T4 GPU requires approximately 15 minutes to achieve reasonable results, demonstrating the accessibility of this educational approach.

3.4 Inference Mechanisms and Text Generation

Inference procedures critically determine output quality and diversity. Greedy decoding—selecting the highest probability token at each step—produces repetitive, uninteresting outputs and is explicitly not recommended for language model generation. Instead, temperature sampling enables selection of lower-probability tokens by adjusting the softmax temperature parameter. A temperature of 0.7 represents a standard middle ground, providing creativity while avoiding incoherence. Top-k sampling complements temperature by preventing selection of extremely unlikely tokens even at high temperatures, further improving generation quality.

The inference implementation converts model logits to probability distributions via softmax, then samples according to the temperature-adjusted distribution. A seed parameter enables reproducible generation by controlling random number generation, facilitating debugging and evaluation. This sampling-based approach, while computationally simple, produces substantially more engaging and diverse outputs than deterministic greedy decoding.

4. Technical Insights

The implementation reveals several actionable technical considerations for researchers developing language models. The relationship between vocabulary size and training data requirements follows the vocab_size² heuristic: a 65-token vocabulary requires approximately 4,225 training examples for adequate bigram coverage, while a 50,000-token vocabulary necessitates 2.5 billion tokens. This quadratic scaling fundamentally constrains tokenization strategy selection based on available training data.

Parameter allocation demonstrates clear scaling patterns. Embedding tables consume vocab_size × embedding_dimension parameters, while positional embeddings require block_size × embedding_dimension parameters. Transformer blocks dominate parameter count in larger models, with each attention head requiring four parameter matrices of size embedding_dimension² and MLP blocks expanding to 4× embedding dimension in hidden layers. For the 384-dimensional embedding used here, increasing vocabulary from 65 to 50,000 tokens would increase embedding table size from 25,000 to 19.2 million parameters—a 768× increase.

The modular nature of transformer blocks enables straightforward scaling: adding layers increases depth while maintaining architectural consistency. Each block operates independently with its own parameters, processing the output of the previous block through identical operations (layer norm → attention → layer norm → MLP). This modularity explains why the core implementation requires only a few hundred lines of code organized into three files: model.py (architecture), train.py (data loading and training), and generate.py (inference).

A critical limitation of this character-level approach involves semantic content: individual tokens lack meaning, requiring many inference steps to generate words. Production BPE-based models encode semantic units directly, substantially reducing generation time and improving efficiency. However, BPE implementation requires more sophisticated data processing infrastructure and larger training datasets, representing a practical trade-off between simplicity and performance.

5. Discussion

The findings demonstrate that fundamental transformer architecture and training procedures remain consistent across model scales, validating the educational approach of understanding core mechanisms through simplified implementations. The 80/20 principle—where 80% of development involves these foundational components and 20% involves optimization—suggests that researchers mastering these basics possess the essential knowledge for production-scale development.

Post-training procedures represent a critical but often underappreciated component of modern language models. Reasoning models, which demonstrate enhanced step-by-step problem-solving capabilities, typically share identical base architectures with standard models. Performance improvements derive from post-training data quality and fine-tuning approaches rather than architectural modifications. For instance, reasoning models train on chain-of-thought data labeled by domain experts (such as PhD students annotating reasoning steps), with special reasoning tokens added to context that allow the model to attend to intermediate reasoning during generation. Similarly, performance improvements between model versions (e.g., Gemini 3 to 3.1) primarily result from superior post-training data rather than architectural changes.

Multimodal extensions employ separate encoders (video encoders, audio encoders) that convert non-text inputs to embedding vectors matching the text embedding dimension. These encoder outputs override embedding layer inputs, enabling the transformer to process video or audio as vectors in the same space as text tokens. Audio processing requires additional complexity, converting waveforms to mel-spectrograms before tokenization and employing different loss functions (L2 loss for mel-spectrograms, KL divergence for model distillation) rather than cross-entropy. This architectural flexibility demonstrates how the core transformer mechanism generalizes beyond text.

Future investigation should examine the scaling laws governing the relationship between model size, training data quantity, and performance. Additionally, the trade-offs between different tokenization strategies at various scales warrant systematic analysis. The role of post-training data quality—particularly for reasoning and instruction-following capabilities—represents a critical area where relatively little public research exists, despite its substantial impact on model capability.

6. Conclusion

This synthesis establishes that transformer-based language models can be constructed from four essential building blocks: tokenization systems, model architecture implementing multi-head self-attention, training loops with appropriate optimization, and inference mechanisms. The practical implementation of a 1.8-million-parameter model demonstrates these principles while revealing critical scaling relationships, particularly the vocab_size² minimum data requirement and quadratic parameter scaling with embedding dimension.

The key technical contributions include quantifiable loss progression patterns indicating learning stages (4.17 for random prediction → 1.0-1.2 for coherent generation), explicit parameter allocation formulas enabling capacity planning, and the demonstration that core architectural principles remain consistent across model scales. The modular nature of transformer blocks, combined with residual connections and layer normalization, enables straightforward scaling from educational implementations to production systems with billions of parameters.

For practitioners, this framework provides actionable understanding of transformer mechanics applicable to production development. The code organization into model architecture, training loop, and inference components reflects standard industry practice. Researchers can extend this foundation to multimodal systems through encoder integration or enhance reasoning capabilities through post-training with chain-of-thought data. The accessibility of this implementation—requiring only modest computational resources and a few hundred lines of code—democratizes understanding of the fundamental mechanisms underlying modern language models.


Sources


About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub