Blade-2, GPT-4 or Claude-2;  Which AI language model is the best?

5 reasons why large language models (LLMs) like ChatGPT use reinforcement learning instead of supervised learning for fine-tuning





With the huge success of Artificial Intelligence Generative in recent months, Large Language Models are continuously advancing and improving. These models are contributing to some noteworthy economic and social transformations. The popular ChatGPT, developed by OpenAI, is a natural language processing model that allows users to generate meaningful text just like humans. Not only that, it can answer questions, summarize long paragraphs, write codes and emails, etc. Other language models, such as Pathways Language Model (PaLM), Chinchilla, etc., have also shown great performance in imitating humans.

Large Language models use reinforcement learning for fine-tuning. Reinforcement Learning is a feedback based machine learning method based on a reward system. An agent learns how to behave in an environment by completing certain tasks and observing the results of those actions. The agent gets positive feedback for every good task and a penalty for every bad deed. LLMs like ChatGPT deliver outstanding performance, all thanks to Reinforcement Learning.

ChatGPT uses Reinforcement Learning from Human Feedback (RLHF) to fine-tune the model while minimizing bias. But why not supervised learning? A basic reinforcement learning paradigm consists of labels used to train a model. But why can’t these labels be used directly with the supervised learning approach? Sebastian Raschkaan AI and ML researcher, shared some reasons in his tweet as to why reinforcement learning is used in fine tuning instead of supervised learning.

Build high-quality training datasets with Kili Technology and solve NLP machine learning challenges to develop powerful ML applications

  1. The first reason not to use supervised learning is that it only predicts ranks. Does not produce coherent responses; the model only learns to give high scores to responses similar to the training set, even if they are not consistent. On the other hand, RLHF is trained to estimate the quality of the response produced rather than just the classification score.
  1. Sebastian Raschka shares the idea of ​​reframing the task as a constrained optimization problem using supervised learning. The loss function combines the loss of the output text and the term of the reward score. This would result in better quality of generated response and ranks. But this approach works successfully only when the goal is to produce question-answer pairs correctly. But cumulative rewards are also needed to allow for consistent conversations between the user and ChatGPT, which SL cannot provide.
  1. The third reason not to opt for SL is that it uses cross-entropy to optimize loss at the token level. While symbolically for a text passage, altering individual words in the response may have only a small effect on the overall loss, the complex task of generating coherent conversation can have a complete context shift if a word is negated. Therefore, depending on SL cannot be enough and RLHF is needed to consider the context and coherence of the whole conversation.
  1. Supervised learning can be used to train a model, but RLHF has been found to tend to work better empirically. A 2022 paper, “Learning to Summarize from Human Feedback,” showed that RLHF outperformed SL. The reason is that RLHF considers cumulative rewards for consistent conversations, which SL fails to capture due to its token-level loss function.
  1. LLMs such as InstructGPT and ChatGPT use both supervised learning and reinforcement learning. The combination of the two is crucial for optimal performance. In these models, the model is first tuned using SL and then further updated using RL. The SL phase allows the model to learn the basic structure and content of the task, while the RLHF phase refines the model’s responses for greater accuracy.


Tanya Malhotra is a final year student at University of Petroleum & Energy Studies, Dehradun pursuing BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, coupled with a burning interest in acquiring new skills, leading teams, and managing work in an organized manner.


Gain a competitive edge with data – actionable market insights for global brands, retailers, analysts and investors. (Sponsored)






Previous articleCambridge and UCLA Researchers Introduce DC-Check: A New Data-Centric AI Checklist-Style Framework to Drive Development of Trusted Machine Learning Systems
Next articleUC San Diego and Meta AI Researchers Introduce MonoNeRF: An Autoencoder Architecture That Separates Video into Camera Motion and Depth Map Using Camera Encoder and Depth Encoder



#reasons #large #language #models #LLMs #ChatGPT #reinforcement #learning #supervised #learning #finetuning
Image Source : www.marktechpost.com

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *