The Mechanics of Pre-training
A Stochastic Optimization Perspective
Modern Large Language Models (LLMs) are not databases of knowledge; they are function approximators. We move beyond manual coding (rules) to Deep Learning, where raw data is fed into a network to approximate the complex probability distribution of language.
This interactive essay explores the four-step cycle that powers this learning: Forward Pass, Loss Calculation, Backward Pass, and Optimization.
1. The Forward Pass
The Forward Pass is the computation of an output probability distribution given an input sequence. Data flows through layers of neurons. Each connection has a weight ($W$).
"The capital of France is [MASK]"
In the visualization above, the lines represent weights. Thicker lines indicate stronger connections. Initially, weights are random. Consequently, the output probabilities are noise—the model might predict "Rhubarb" just as likely as "Paris".
2. The Loss Function (Cross-Entropy)
To improve, the network must quantify how "wrong" it is. We use Negative Log Likelihood. If the true next word is "Paris", we only care about the probability assigned to "Paris" ($p_{\text{target}}$).
Drag the slider. Notice that as the probability of the correct word approaches 1.0, the Loss approaches 0. Conversely, if the model is confident but wrong (Probability near 0), the Loss explodes towards infinity.
3. The Backward Pass (Backpropagation)
Knowing the error is not enough; we must attribute blame. We calculate the Gradient ($\nabla$): the partial derivative of the Loss with respect to every weight in the network.
This implies a Sensitivity Analysis. If a weight has a high gradient, a tiny "wiggle" to that weight causes a significant drop in Loss. In the simulation above, hover over a connection to select it, then use the slider to manually adjust it and watch the Loss change.
4. Optimization (Gradient Descent)
We have the gradients. Now we update the parameters to minimize loss. We take a step in the direction opposite to the gradient.
Try this: Set the Learning Rate very high (> 1.0). The ball will overshoot the valley and oscillate (instability). Set it too low (< 0.05), and progress becomes agonizingly slow.
5. Conclusion: The Training Loop
By repeating this cycle—Forward, Loss, Backward, Optimize—trillions of times on diverse data, the model learns the structure of language.