Why the AI hype now?

The concept of AI is nothing new, so why the hype now?

In short: AI developers recently noticed that information is really well conveyed in text form. Now you might think: “Isn’t that obvious? In school, kids are taught using natural language. Most books ever written are mostly made up of sentences and even math literature is full of text explanations.“ Well, teaching AI models was done differently for a long time and rather resembled demonstrations than explanations. With Large Language Models (LLMs), that has changed. But why?

So in detail …

You may have noticed that the hype was triggered by ChatGPT, or specifically the success of its underlying GPT-2 model. OpenAI’s original blog post and publication announcing GPT-2, released early 2019, contained an interesting observation, spoilered in the paper’s title: “Language models are unsupervised multitask learners“.

Although the model’s initial purpose was text completion, after training this huge neural network architecture of 1.5 billion adjustable parameters with a massive amount of text collected from 8 million web pages, it was noticed that it had not only learned the syntax and structure of natural language for the main task of text completion, but it also seemed to have memorized much of the specific information contained in the training material. So beyond learning syntax, it also internalized semantics!

Thus, given questions or instructions, the model was seemingly able to “understand” the meaning behind them and fulfill the wishes of the user, enabling it to be used for many different tasks. In fact, along the way, it even learned things like translation and coding, simply because the training material contained different languages and coding guides.

Before Large Language Models took the stage

For a long time, most machine learning models were developed with a specific task in mind, like translation, summarization, text classification, or named entity recognition, to name the most prominent examples from the natural language processing (NLP) realm. At this point, models were usually trained exclusively for their specific purpose, because it was widely believed that the performance would only improve with increased specialization. The idea was to avoid "confusing” a model in its purpose and desired way of processing by only teaching it one well-defined way to act, like translating text from English to Spanish.

Teaching one model to translate into Spanish and French was long believed to reduce performance and efficiency, since the training process would involve showing it many examples of English-Spanish and English-French text pairs, which would both adjust the same set of parameters within the model. Intuitively, many adjustments would counteract each other, basically with French translation examples undoing the settings that had previously optimized the Spanish translation, and vice versa. In effect, this would make training more inefficient and also decrease the model’s performance in both languages. One may get the idea to segregate the adjusted parameters of the model for the two languages, but this is pretty much equivalent to training two separate models.

That was the era when many researchers despised the term “artificial intelligence” and preferred “machine learning” - fittingly so, since the above mentioned types of models were trained using what is called supervised learning, with given inputs and corresponding outputs (like an English sentence and its Spanish translation). Machine learning models were not expected to perform well on tasks that they were not trained for, so calling it intelligence was a bit of a stretch (see this post for details) [#TODO link “What is the difference between AI and machine learning?“]. For sake of completeness, there is also unsupervised learning, which can be useful to uncover unknown relationships or categorizations in data, but that uses models with other mechanisms and purposes, which are not discussed in this post (you can read this one for details on that) [#TODO post about “Supervised, unsupervised, self-supervised learning and reinforcement learning“, maybe title it “How is AI trained?“ or “How does AI learn?“].

What makes LLMs different

The exciting part about the GPT models and related architectures from the Transformer family (details below) [#TODO make a blog post about encoder-only, decoder-only, and encoder-decoder models and link here], is that they are not trained for one specific task other than general text completion (or missing text completion in some cases, like Google’s BERT model, which was one of the first famous Transformers). Preparing training data for this is easy because nothing has to be curated for any specific purpose (as opposed to pairing English and Spanish sentences with the same meaning in order to learn translation). Instead, using self-supervised learning, these models are given existing natural text and shown one word at a time, with the repetitive task to predict the following word, which is then compared to the actual next word in the text, upon which a prediction error is used to correct the model’s parameters, eventually leading to better and better performance. This works similarly for missing word prediction, just that the model also knows which words follow after the masked (missing) text, as opposed to only knowing the former text (as in the case of causal language modeling, described just before and undertaken by the likes of GPT).

How LLMs came to be

The popular training paradigm first shifted from supervised learning (with manually prepared input-output pairs) to self-supervised learning (using unlabeled training data, meaning the absence of hand-curated target outputs) with the development of recurrent neural networks and similar machine learning techniques, like Long Short Term Memory (LSTM) networks, which process inputs and predict outputs sequentially, one after another. Such models predict outputs one word or phrase at a time, based on the input and what has already been produced as output. Traditionally, such sequential predictions could not sensibly be made using multiple computers or computational cores and took a long time in order to process large amounts of text. Also, there was an issue with ensuring that the models remembered important information from long text pieces, since they would tend to overwrite “old memories” with more recent information from later in the text.

The balance between required computational resources and model performance did not reach an acceptable level until the introduction of the Transformer architecture, which the GPT family and most modern language models are based on. The new concept of Attention mechanisms, introduced in 2017 by Google researchers, facilitated a shift from purely sequential processing to massive parallelization. The utilization of graphics cards and similar computer chips, which can process many small computations in parallel (at the same time), enabled model architectures with many more adjustable parameters (weights) to be used with much larger training datasets in a feasible amount of time, all thanks to the Attention mechanism and the Transformer architecture’s parallel processing capabilities. Also, attention mechanisms (specifically self-attention and cross-attention) elegantly improved the AI’s ability to associate and remember important information from anywhere in the text, essentially by connecting words with certain similarity or relations to each other during internal processing within the Transformer model.

Coinciding with the availability of increasingly powerful graphics processors with large enough memory (VRAM) to house the huge number of model parameters and derivatives required during training, the LLM and Generative AI era emerged. Generative AI more widely refers to AI models that generate new content, like text, images, or videos. This prominently also includes diffusion models for image generation, but they are not the focus of this article (see this one if you’re interested) [#TODO diffusion model article].

In the past few years, new LLM architectures and training datasets have continued to grow bigger, exhibiting increasing model performance and learning capacity. Today, a so-called pre-training with trillions of words collected from free text on the internet is used to adjust often hundreds of billions of model parameters, costing millions of dollars in computational power to produce a powerful generalist model like GPT-4 from scratch. The result of pre-training is usually called a foundation model, which can be thought of as having good general knowledge, but no specialization in any specific domain.

Where LLMs stand today

Modern LLMs developed this way by big names, like OpenAI, Anthropic, Google, Meta, Microsoft and others around the world, are racing to produce the best foundation models and break records, having long tremendously outperformed classic, specialized models from supervised learning in all kinds of tasks, like translation, writing, and coding. For a continuously updated ranking of the best models available for different task categories, we recommend to check out the LMSys Chatbot Arena Leaderboard.

Existing LLMs can be improved further with more data through continued training and specialized for specific domains or tasks, most popularly with a training method called fine-tuning, where only a small portion of the model’s parameters from pre-training are adjusted (typically the last layers - see this post for details on what that means) [#TODO: Make “neural networks architectures” post about neurons and layers, maybe name “What are AI architectures?“]. This keeps the general knowledge intact, but achieves a specialized output format or tone, demonstrated in the fine-tuning dataset. The result of fine-tuning is often called a model checkpoint, which again can be trained further. If you want an LLM to act in a very specific way and prompt engineering doesn’t get the job done, then fine-tuning is the place to go, with many energy-efficient methods having been developed, that can be used with just a single consumer GPU or even CPU, if you are patient. Most popularly, this involves Low-Rank Adaptation (LoRA) techniques. Details would blow the scope of this article, but it uses a clever algebraic principle called matrix factorization and is worth checking out if you are well-versed in linear algebra.

A word of caution about the LLM hype

If you’ve made it this far, you really deserve a valuable take-home message. So, here it is:

Despite astounding results, it is still important to remember that LLMs are, in essence, still just predicting the most likely text continuation based on your given question or assignment, as they have learned from everything that they have “read”. In the end, they are still statistical predictors for text pieces (so-called tokens, which are usually short words or word pieces), sampling their reply together solely based on the text that they have been shown during training.

The LLMs that we have today are not reasoning machines, that would base their answers on logic, real understanding, or rules of nature. Instead, they give you the most likely text continuation for your input (with some variance due to their top_p, top_k, and temperature settings, but that’s for another time). Newer models like OpenAI’s o1 line mimic reasoning by internally breaking down a problem or task and solving it in smaller steps, but each step is still taken using the predictive mechanism just described. Therefore, you should always use them with caution, limit your trust in their answers, and remember that they are not really intelligent - just great memorizers that have become good at guessing your next word (or that of the internet).

Some may argue that this is already intelligence, since smart people usually know a lot (having memorized lots of information). But focusing on the difference between being smart and being educated, LLMs are only the latter. They are very well educated, but lack the ability to creatively think and freely form new opinions and interpretations by themselves, independent of what they have been demonstrated during training. In this regard, judging them as one would judge people, they are educated but not really smart. So, while using LLMs is extremely helpful for many purposes, never forget to think for yourself and be smart about your usage of AI.