Master of all Trades
Why Mixture of Experts (MoE) might be the best architecture to mimic the Human Brain and achieve Artificial General Intelligence.
Innovation occurs in waves. In the current wave of artificial intelligence, one particular architecture of neural networks has created ripples: the Mixture of Experts. Many companies and open-source projects have adopted this framework to launch a variety of new foundational models such as Mixtral, Grok-1, DBRX, Jamba, and many more. In fact, there are even rumors that OpenAI’s GPT-4 is based on a Mixture of Experts architecture. While the principle behind this architecture is not new—it was first proposed in the early 1990s—its resurgence and popularity might help predict the future of artificial intelligence.
The intuition behind Mixture of Experts is a simple divide and conquer algorithm. Break down a problem into specific sub-problems; assign tasks to “experts” best designed to solve those sub-problems and efficiently combine the results into a final answer. While Adam Smith may have been the first to observe how specialization leads to higher productivity in economics, the concept of division of labor is an integral part of human biology. In some sense, our brain is also a mixture of experts— consisting of several specialized structures, each one responsible for a specific function while working together to give life to the human experience.
When you touch the back of your head at the spot where you feel a bump, you are touching the area directly above your occipital lobe. This lobe is one of the four primary segments of the gray matter, or cortex, within your brain, which also includes the Frontal, Parietal, and Temporal lobes. Each of these lobes is tasked with performing specialized functions: decision-making, interpreting sensory information, visualizing images, and understanding spoken language. Similarly, a Mixture of Experts model comprises smaller expert models—typically eight—combined together through a "router" which takes the model inputs, selects the appropriate experts, and then produces the final output. This router acts similarly to the Thalamus, which is considered the "relay station" of the brain, connecting sensory nerves to the correct parts of the cortex and carrying return signals to the rest of the body.
💎 Mighty Metric
There are roughly 100 billion neurons in the Human Brain. By comparison, Google's Switch Transformer (one of the largest MoE models) has 1.6 trillion parameters!
The similarities between artificial intelligence models and the human brain don’t stop at the functional hierarchy of their components. In fact, the fundamental building blocks of all modern AI models are based on a simple processing unit: neurons. Layers of neurons make up a neural network, which is an efficient way to meaningfully represent inputs, store relationships between those inputs, and perform complex mathematical operations. What is commonly referred to as "learning" is simply the process of changing the relationships based on large amounts of training data. Obviously, neurons in computers are inspired by the neurons in our brains, which are living cells and perform a series of electrochemical processes to communicate with other neurons. In our brain, the learning process is continuous but similar—neurons that consistently communicate with each other form a stronger relationship. Indeed, neurons that fire together wire together.
While neural networks have been around for a long time and have shown great improvements in performance through different architectures and parameter sizes, Mixture of Experts models add new layers of sophistication beyond basic biology. From a cognitive science perspective, this architecture can be seen as an implementation of a key framework to describe human consciousness—the Global Workspace Theory. The theory tries to explain the properties of consciousness, whether human or non-human, and lists a set of indicators that are necessary for a system to be considered conscious. From the properties listed in table below, one can see that the Mixture of Experts takes a giant leap towards making artificial intelligence more than just fancy number crunching and set the stage for Artificial General Intelligence.
Practical Reasons for Mixture of Experts
Apart from reorganizing the flow of information in existing large language models, a Mixture of Experts also helps address important technical challenges:
The most important factor in improving model quality is the size of the model. A Mixture of Experts allows scaling up to a larger parameter size without updating the weights (relationships between neurons) for the entire model but only for individual experts. This sparse representation of the model allows for compute-efficient training.
Additionally, the unique structure of a Mixture of Experts model allows for further optimization using a technique known as Expert Parallelism—allowing experts to run independently on different GPUs.
Lastly, a Mixture of Experts model allows for better inference speeds and higher throughput when the models are deployed, since the model only uses a small portion of its total number of parameters at any given time.