models with billions of parameters are becoming more standard, even for smaller versions of large models. But having a lot of parameters doesn’t necessarily mean they have a lot of hidden layers! Let’s break down what parameters and hyperparameters are and how they relate to the size and architecture of these models.
What Are Parameters?
Parameters are the internal variables of the model that the model learns during training. These include:
1. Weights: Each connection between neurons in the layers has an associated weight, which determines the strength and direction of the influence one neuron has on another.
2. Biases: Each neuron also has a bias term, which shifts the activation. It’s like an extra input that allows the model more flexibility in fitting the data.
In deep learning, each layer in a neural network has its own set of parameters (weights and biases), and these parameters are adjusted during training to minimize the model’s loss function (error). The more layers and neurons there are, the more parameters a model will have.
For Example:
• If we have a layer with 100 neurons fully connected to another layer with 100 neurons, we have 10,000 weights in that layer alone (100 x 100).
• When we talk about an 8-billion parameter model, that means the model has 8 billion individual values (weights and biases) that can be adjusted during training.
What Are Hyperparameters?
Hyperparameters are the external settings or configurations of the model that are chosen before training begins and are not updated during training. They control aspects like:
1. Learning Rate: Determines how quickly or slowly the model learns. A high learning rate means the model updates its weights quickly, while a low rate means slower updates.
2. Number of Layers: The depth of the model (how many layers it has), which affects its ability to learn complex patterns.
3. Batch Size: The number of samples the model processes at once before updating the weights.
4. Number of Neurons per Layer: The width of each layer, influencing the model’s capacity.
5. Dropout Rate: A regularization technique to prevent overfitting by randomly dropping units during training.
Hyperparameters don’t directly affect the learned parameters but set up the structure and rules under which the model learns those parameters.
Parameters vs. Hyperparameters
• Parameters are learned by the model itself, adjusting based on the training data to minimize error.
• Hyperparameters are defined by the person building the model and impact the model’s overall training behavior and architecture.
Why 8 Billion Parameters?
In large language models (LLMs) like GPT-3, many parameters allow the model to learn a vast array of patterns and relationships in data. It’s not so much about depth (number of layers) as it is about the number of attention heads and embedding dimensions that let the model “understand” different aspects of language. For instance, these billions of parameters help the model understand context, syntax, and nuanced meanings in language, which is why these large models are so powerful in NLP tasks.
Layers vs. Parameters
• Layers are the number of transformations the data goes through from input to output.
• Parameters are all the adjustable parts (weights and biases) within these transformations.
Large models may have relatively few layers (like 12 to 96 in transformer-based models) but can have billions of parameters due to the dense connections and size of each layer.