Before training, the "P" in GPT (Pre-trained) essentially stands for "Pointless." The network takes raw input (tokens) and pushes them through layers of "mixers" (neurons).
Initially, the parameters are random garbage, so the output is random garbage. It doesn't know facts; it just multiplies matrices.
Click to scramble the brain. It never knows the answer.
We know the answer is "Paris." If the machine assigns "Paris" a low probability (e.g., 2%), we need a number that screams "BAD!"
This is the Loss Function (Cross-Entropy). As the probability of the correct answer goes down, the punishment (Loss) skyrockets.
We can't just yell at the machine. We use calculus (the Chain Rule) to calculate the Gradient. This tells us exactly how much each specific connection contributed to the error.
Hover over the connections below to see the "Sensitivity." Red lines are the "guilty" neurons that need to change the most.
We have the gradients. We know which way is "downhill" (lower loss). We take a tiny step. This is Gradient Descent.
Below is a real, working single-neuron perceptron running in your browser. Watch the math happen.
Training Steps: 0
The magic isn't in the math. The magic is that minimizing loss forces the model to learn the shape of language.
If we train too hard on specific sentences, we Overfit (Memorize). We want to Generalize.
Select a mode to see how the machine learns the data.