1.8 TRILLION parameters across 120 layers, making it 10 times larger than GPT-3!
16 EXPERTS within the model, each with 111 BILLION parameters for MLP!
13 TRILLION tokens of training data, including text-based and code-based data, with some fine-tuning from ScaleAI and internally!
$63 MILLION in training costs, taking into account computational power and training time!
3 TIMES MORE expensive to run than the 175B parameter Davinci, due to larger clusters and lower utilization rates!
128 GPUs for inference, using 8-way tensor parallelism and 16-way pipeline parallelism!
VISION ENCODER for autonomous agents to read web pages, transcribe images, and videos, adding more parameters and fine-tuned with 2 TRILLION tokens!
And, get this... GPT-5 might have 10 TIMES THE PARAMETERS of GPT-4! That means even larger embedding dimensions, more layers, and double the number of experts!
Top comments (0)