The Mechanics of Pre-training

A Stochastic Optimization Perspective

Modern Large Language Models (LLMs) are not databases of knowledge; they are function approximators. We move beyond manual coding (rules) to Deep Learning, where raw data is fed into a network to approximate the complex probability distribution of language.

This interactive essay explores the four-step cycle that powers this learning: Forward Pass, Loss Calculation, Backward Pass, and Optimization.

1. The Forward Pass

The Forward Pass is the computation of an output probability distribution given an input sequence. Data flows through layers of neurons. Each connection has a weight ($W$).

$$\hat{y} = \sigma(W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2)$$

Input: "The capital of France is [MASK]"

Status: Waiting...

In the visualization above, the lines represent weights. Thicker lines indicate stronger connections. Initially, weights are random. Consequently, the output probabilities are noise—the model might predict "Rhubarb" just as likely as "Paris".

2. The Loss Function (Cross-Entropy)

To improve, the network must quantify how "wrong" it is. We use Negative Log Likelihood. If the true next word is "Paris", we only care about the probability assigned to "Paris" ($p_{\text{target}}$).

$$\mathcal{L} = - \log(p_{\text{target}})$$

P(Paris):

Drag the slider. Notice that as the probability of the correct word approaches 1.0, the Loss approaches 0. Conversely, if the model is confident but wrong (Probability near 0), the Loss explodes towards infinity.

3. The Backward Pass (Backpropagation)

Knowing the error is not enough; we must attribute blame. We calculate the Gradient ($\nabla$): the partial derivative of the Loss with respect to every weight in the network.

$$\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w}$$

Hover over a weight line to see its Gradient.

Wiggle Selected Weight:

Loss: --

This implies a Sensitivity Analysis. If a weight has a high gradient, a tiny "wiggle" to that weight causes a significant drop in Loss. In the simulation above, hover over a connection to select it, then use the slider to manually adjust it and watch the Loss change.

4. Optimization (Gradient Descent)

We have the gradients. Now we update the parameters to minimize loss. We take a step in the direction opposite to the gradient.

$$W_{\text{new}} = W_{\text{old}} - \eta \cdot \nabla \mathcal{L}$$

Learning Rate ($\eta$):

Loss: --

Try this: Set the Learning Rate very high (> 1.0). The ball will overshoot the valley and oscillate (instability). Set it too low (< 0.05), and progress becomes agonizingly slow.

5. Conclusion: The Training Loop

By repeating this cycle—Forward, Loss, Backward, Optimize—trillions of times on diverse data, the model learns the structure of language.

Prediction for 'Paris': 10% | Loss: 2.30