Not too long ago, having just a few hidden layers—two or three—was considered “deep,” and models were constrained by computational limits. But now, with advances in hardware (especially GPUs and TPUs), training techniques, and data availability, the depth and size of neural networks have scaled dramatically.
Evolution of “Deep” in Deep Learning
• Early Neural Networks: In the 80s and 90s, neural networks with one or two hidden layers were the norm. We could only work with a few dozen neurons per layer due to limited computing power, and training deep networks was challenging because of issues like the vanishing gradient problem.
• The Rise of “Deep” Networks: Around the 2010s, networks with three to ten layers were popular. AlexNet (2012), which had eight layers, is often credited with sparking the deep learning revolution by achieving state-of-the-art results in image recognition.
• Modern Networks: Now, models like ResNet (2015) have hundreds of layers, and newer architectures, like Transformers, can have thousands of layers or, as you mentioned, seemingly endless numbers of parameters. Google’s “Switch Transformer” has over a trillion parameters, which is mind-blowing compared to older models. In addition, models like GPT-4 and large multimodal models (like PaLM-E) have continued to push the boundaries.
Why So Many Layers?
More layers let the network learn increasingly abstract representations of the data:
1. Shallow Layers capture simple features, like edges in images.
2. Mid Layers combine those simple features to capture patterns, like shapes.
3. Deep Layers capture complex, abstract features, allowing the network to understand high-level concepts (like “dog” or “cat” in image recognition).
Are Gazillions of Layers Always Needed?
Not always! The optimal number of layers depends on the task and data complexity. While deep networks can capture complex patterns, they also require more training data and computing power and can become harder to interpret.
The field is also exploring ways to achieve powerful representations without necessarily adding tons of layers. Techniques like attention mechanisms and transformer architectures have shown that sometimes a smaller network with clever connections and parameter sharing can perform as well as, or even better than, a “gazillion-layer” network.