Why do Large Language Models like ChatGPT use Reinforcement Learning over Supervised Learning for finetuning?

Large Language Models (LLMs) like ChatGPT have taken the world by storm in the last few years. They have shown tremendous progress in natural language processing and have contributed to some significant economic and societal transformations. These models can generate meaningful text, answer questions, summarize long paragraphs, write code and emails, and much more.

To achieve such exceptional performance, LLMs use reinforcement learning for fine-tuning. Reinforcement Learning is a feedback-driven Machine Learning method based on a reward system. An agent learns to perform in an environment by completing certain tasks and observing the results of those actions. The agent gets positive feedback for every good task and a penalty for each bad action. LLMs like ChatGPT use Reinforcement Learning from Human Feedback (RLHF) to fine-tune the model by minimizing the biases.

Why do Large Language Models like ChatGPT use Reinforcement Learning over Supervised Learning for finetuning

But why use Reinforcement Learning instead of Supervised Learning for fine-tuning? In this blog post, we will discuss five reasons why LLMs like ChatGPT use Reinforcement Learning.

Predicting Ranks vs. Producing Coherent Responses

Supervised Learning only predicts ranks, and it doesn't produce coherent responses. The model just learns to give high scores to responses similar to the training set, even if they are not coherent. On the other hand, RLHF is trained to estimate the quality of the produced response rather than just the ranking score. This helps to generate coherent responses that make sense in the context.

Cumulative Rewards for Coherent Conversations

Reinforcement Learning allows for the consideration of cumulative rewards for coherent conversations, which Supervised Learning fails to capture due to its token-level loss function. ChatGPT uses RLHF to fine-tune the model to generate coherent conversations that make sense and flow naturally.

Context and Coherence of Entire Conversation

Supervised Learning uses cross-entropy to optimize the token-level loss, but it fails to consider the context and coherence of the entire conversation. The complex task of generating coherent conversations can have a complete change of context if a word is negated. Thus, depending on Supervised Learning alone cannot be sufficient, and RLHF is necessary for considering the context and coherence of the entire conversation.

Empirical Performance

While Supervised Learning can be used to train a model, empirical evidence has shown that RLHF tends to perform better. A 2022 paper, “Learning to Summarize from Human Feedback,” showed that RLHF performs better than SL. The reason is that RLHF considers the cumulative rewards for coherent conversations, which Supervised Learning fails to capture due to its token-level loss function.

Combination of Supervised and Reinforcement Learning

LLMs like InstructGPT and ChatGPT use both Supervised Learning and Reinforcement Learning. The combination of the two is crucial for attaining optimal performance. In these models, the model is first fine-tuned using SL and then further updated using RLHF. The SL stage allows the model to learn the basic structure and content of the task, while the RLHF stage refines the model’s responses for improved accuracy.

In conclusion, Reinforcement Learning is crucial for the fine-tuning of Large Language Models like ChatGPT. The combination of Supervised and Reinforcement Learning is essential for optimal performance, but Reinforcement Learning is necessary to capture the context and coherence of the entire conversation and consider cumulative rewards for coherent conversations. These models are constantly advancing and improving, and we can expect to see even more impressive feats in the future.

Post a Comment

0 Comments