Cerebras, the company behind the world’s largest accelerator chip, the CS-2 Wafer Scale Engine, has announced a major achievement: training the world’s largest AI (Natural Language Processing) (NLP) model on a single machine. While that in and of itself could mean many things (it wouldn’t be much of a record-breaking if the previous largest model was trained in a smartwatch, for example), the AI model Cerebras trained has soared toward a staggering – and unprecedented – 20 billion Teacher. All without having to scale your workload across multiple accelerators. That’s enough to fit into the Internet’s newest sense, the image-of-text generator, and OpenAI’s 12 billion DALL-E parameter. (Opens in a new tab).
The most important part of achieving Cerebras is reducing infrastructure requirements and software complexity. Sure enough, one CS-2 is like a supercomputer on its own. The Wafer Scale Engine-2 — which, as the name suggests, is etched into a single, 7nm chip, and is usually enough for hundreds of main chips — features 2.6 trillion 7nm transistors, 850,000 cores, and 40GB of built-in cache. A package consumes about 15 kW.
Keeping up to 20 billion NLP model variants in a single chip significantly reduces overhead in training costs across thousands of GPUs (and their associated hardware and scaling requirements) while eliminating the technical difficulties of segmenting models across. This is “one of the more painful aspects of NLP workloads,” Cerebras says, “sometimes taking months to complete.”
It’s a specific and unique issue not just for each neural network being processed, the specification for each GPU, and the network that ties it all together – items that must be laid out beforehand before the first ever training begins. It cannot be transferred across systems.
The net numbers might make Cerebras’ achievement seem disappointing – OpenAI’s GPT-3, a NLP model that can write whole articles that can sometimes fool readers, features a staggering 175 billion metrics. DeepMind’s Gopher, which launched late last year, brings that number to 280 billion. The brains at Google Brain have announced the training of a trillion-plus-factor model, the Switch Transformer.
“In NLP, larger models appear to be more accurate. But traditionally, very few select companies have had the resources and expertise to do the hard work of deconstructing these large models and spreading them across hundreds or thousands of GPUs,” he said. Andrew Feldman, CEO and Co-Founder of Cerebras Systems. As a result, very few companies were able to train large NLP models – it was too expensive, time-consuming, and inaccessible to the rest of the industry. Today we are proud to democratize access to GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B, and GPT-NeoX 20B, allowing the entire AI ecosystem to create large models in minutes and train them on a single CS-2. ”
However, just like the clock speed of the world’s best CPUs, the number of parameters is only one potential indicator of performance. Recently, better results have been worked on with fewer parameters – for example, the chinchilla routinely outperforms both GPT-3 and Gopher with only 70 billion of them. The goal is to work smarter, not harder. As such, Cerebras’ achievement is more significant than what first meets the eye—researchers should be able to fit increasingly complex models even if the company says its system has the potential to support models with Hundreds of billions even trillions of parameters.
This explosion in the number of viable parameters takes advantage of Cerebras’ Weight Streaming technology, which can separate computing and memory traces, allowing memory to be scaled to whatever quantity is needed to store the rapidly growing number of parameters in AI workloads. This allows setup times to be reduced from months to minutes, and to easily switch between models such as the GPT-J and GPT-Neo “With just a few keystrokes“.
“Cerebras’ ability to deliver large language models to audiences with easy and cost-effective access opens an exciting new era in artificial intelligence. It gives organizations that cannot afford to spend tens of millions an easy and inexpensive way to join in,” said Dan Olds, chief research officer at Intersect360 Research. To Major Neuro Linguistic Journals.” “It will be interesting to see new applications and discoveries for CS-2 clients as they train GPT-3 and GPT-J class models on huge data sets.”
#Cerebras #slay #GPUs #break #record #largest #models #trained #single #machine