Revolutionizing Content Generation: The Dynamic Duo of RLHF Unleashed - PPO and PEFT Fine-Tuned LLMs

PART - I

Aug 04, 2023

In this series of articles, we'll learn how RLFH (Reinforcement Learning from Human Feedback) is used to create less harmful content. Starting with the fundamentals of RLHF, we next use generative AI to learn how to summarise dialogue.Additionally, we will study inferences from zero shots, one shots, and a few shots.

RLFH

Reinforcement learning from human feedback (RLHF) is a machine learning strategy that combines human instruction and reinforcement learning strategies like comparisons and rewards to train an artificial intelligence (AI) agent.

Feedback from humans can be crucial at certain points in the development of an interactive or generative AI, such as a chatbot. For text that has been created, using feedback from humans can help to improve the model and make it more effective, logical, and useful. In contrast to self-training alone, RLHF uses direct feedback from users and testers to improve the language model. RLHF is predominantly used in natural language processing (NLP) for AI agent understanding in applications like text to speech, summarization, and chatbots and conversational agents.

RLHF aims to develop language models that generate factually accurate and engaging text. It accomplishes this by first developing a reward model to predict how humans will evaluate the text quality generated by the language model based on feedback from humans. This reward model is then used to train a machine learning model that can predict how humans will evaluate the text.The language model is then fine-tuned using the reward model, where it is rewarded by generating text that is well-rated by the reward model.It also gives the model the ability to reject questions that go beyond the scope of the request. Models, for instance, frequently refuse to provide any content that promotes violence or is racial, sexist, or homophobic.

Training the LLM with RLFH

The following stages are typically included in the training LLM with RLFH:

Pretraining a language model (LM)
Prepare dataset for human feedback and training a reward model
Fine-tuning the LM with reinforcement learning

**Reinforcement Learning from Human Feedback(RLHF)**

Pretrained LLM : It is a model that has been trained using an extensive amount of data. It might be referred to as a base llm. For example : flan-t5-base.
Instruct fine-tuned LLM : It is a model which is fine-tuned on our custom data to peform our specific tasks. For example : text summarization.
RLHF : According to our preferences, we can apply RLHF on either the base model or the instructed fine-tuned model, but for the purposes of the article, I'm going to use RLHF on the instructed fine-tuned LLM.

Pretraining a language model (LM)

A pre-trained language model (LM) serves as the foundation for RLHF.The base model can be fine-tuned with additional text or specific conditions to better understand the structure and patterns of language.This stage aims to enable the LM generate reasonable text in response to a prompt.

Prepare dataset for human feedback

The first step in finetuning an LLM with RLHF is to choose a model to work with and use it to generate a data collection for human feedback.In the figure above, I've shown a fine=tuned llm that receives a prompt and responds with a number of responses (three in the example above).The next step is to gather feedback on the completions generated by the LLM from human labelers. This is the part of reinforcement learning that incorporates human feedback.We must first choose the criterion that will be used to evaluate the completions by humans.After making a decision, ask the labelers to evaluate each completion in the data set according to that criterion.

Let's take a look at an example. In this case, the prompt is,I like superhero movies and.We pass this prompt to the LLM, which then generates three different completions. The task for the human labelers is to rank the three completions in order of helpfulness from the most helpful to least helpful.The process is repeated for numerous prompt completion sets in order to gather data that may be used to train the reward model that will eventually perform this task in place of humans.Multiple human labelers are typically given the identical prompt completion sets to complete in order to reach consensus and lessen the impact of any poor labelers.We need to convert the ranking data into a pairwise comparison of completions before passing the data to reward model for training.

Training a reward model

The reward model could be either an Instruct-finetuned LLM or a pretrained LLM.The reward model's objective is to predict the reward signal from input and generated text.

The reward model learns to favour the completion that the user prefers, yj, while minimising the log sigmoid of the reward difference, rj-rk, for a given prompt X.The reward model can be used as a binary classifier to offer a set of logics across the positive and negative classes after the model has been trained on the human rank prompt-completion pairs.Logics are the unnormalized model outputs before applying any activation function.

Fine-tuning with reinforcement learning

We start with an instruct model that already has good performance on our task of interests.We pass a prompt from our prompt dataset. Here we pass “Artificial intelligecne is for human” to instruct model which then generates a completion,"augmentation." Next, we sent this completion, and the original prompt to the reward model as the prompt completion pair.The reward model evaluates a reward value after evaluating the pair on the basis of the human feedback it was trained on. An higher value, like the 0.30 seen above, indicates a better aligned response.The reinforcement learning algorithm receives this reward value for the prompt completion pair, which is then used to update the LLM's weights and direct it to generate more aligned responses with higher rewards.As the model generates text that is more closely in line with human preferences with each iteration, this process should be effective if we observe the reward getting better with each iteration.This iterative approach is carried out repeatedly until our model is in alignment with the evaluation criteria.

Instruction tuning

Instruction tuning is a technique that involves fine-tuning a language model on a collection of NLP tasks using instructions.By using this technique, rather using specific datasets for each task, the model is trained to complete tasks by adhering to textual instructions. The model can generalise to new tasks that it hasn't been explicitly trained on as long as prompts are given for the tasks. The model is fine-tuned with a collection of input and output examples for each task, allowing the model to generalise to new tasks that it hasn't expressly trained on. When large data sets aren't readily available for certain tasks, instruction tuning is useful since it increases the accuracy and efficiency of models.

Zero-shot learning

In Zero-shot learning , a pre-trained LLM can generate responses to tasks that it hasn’t been specifically trained for. In this approach, the LLM is provided with an input text and a prompt that describes the expected output from the model in natural language.Even for prompts that haven't been particularly trained on, pre-trained models can use their knowledge to generate coherent and relevant responses. Zero-shot learning can reduce the amount of time and data required while improving the efficiency and accuracy of NLP tasks. Zero-shot learning is used in a variety of NLP tasks, including question answering, summarization, and text generation.

Zero Shot Inference

We will be using the pre-trained Large Language Model (LLM) meta-llama/Llama-2-7b-chat-hf from Hugging Face.

#installing necessary package
def get_necessay_package_installed():
  #function to install necessary package
  !pip install -q -U trl transformers accelerate
  !pip install -q datasets bitsandbytes einops scipy
  print("<<<<Necessary Package Installed>>>>")
get_necessay_package_installed()

#importing necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

def get_zeroshot_inference():
  text='''[INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Can you summarize the following text  :
  Question answering (QA) systems are a type of natural language processing (NLP) technology that provide
  precise and concise answers to questions posed in natural language. These systems have the potential to
  revolutionize the way we access information and can be applied in a wide range of fields including education,
  customer service, and health care.There are several approaches to building QA systems, including rule-based,
  information retrieval, and machine learning-based approaches. Rule-based systems rely on predefined rules and
  patterns to extract answers from a given text, while information retrieval systems use search algorithms to
  retrieve relevant information from a large database. Machine learning-based systems, on the other hand, use
  training data to learn to extract answers from text.One of the main challenges faced by QA systems is the need
  to understand the context and intent behind a question. This requires the system to have a deep understanding of
  the language and the ability to make inferences based on the given information. Another challenge is the need to
  extract relevant information from a large and potentially unstructured dataset.Despite these challenges, QA
  systems have a wide range of applications, including education, customer service, and health care.
  .[/INST]'''
  device = "cuda:0"

  inputs = tokenizer(text, return_tensors="pt").to(device)
  outputs = model.generate(**inputs, max_new_tokens=400)
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))

OUTPUT :
Question answering (QA) systems are a type of natural language processing (NLP) technology that provide precise and concise answers to questions posed in natural language. These systems have the potential to revolutionize the way we access information and can be applied in various fields such as education, customer service, and healthcare. There are different approaches to building QA systems, including rule-based, information retrieval, and machine learning-based approaches.
The main challenges faced by QA systems include the need to understand the context and intent behind a question, as well as the need to extract relevant information from a large and potentially unstructured dataset. Despite these challenges, QA systems have a wide range of applications and can be used in various industries.

One Shot Inference

In One shot inference, an LLM is provided with one example of prompt-response pairs that match our task - before the actual prompt that we want completed.This is called "in-context learning".

def get_oneshot_inference():
  text='''[INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Can you summarize the following text  :
  Context: Question answering (QA) systems are a type of natural language processing (NLP) technology that provide
  precise and concise answers to questions posed in natural language. These systems have the potential to
  revolutionize the way we access information and can be applied in a wide range of fields including education,
  customer service, and health care.There are several approaches to building QA systems, including rule-based,
  information retrieval, and machine learning-based approaches. Rule-based systems rely on predefined rules and
  patterns to extract answers from a given text, while information retrieval systems use search algorithms to
  retrieve relevant information from a large database. Machine learning-based systems, on the other hand, use
  training data to learn to extract answers from text.One of the main challenges faced by QA systems is the need
  to understand the context and intent behind a question. This requires the system to have a deep understanding of
  the language and the ability to make inferences based on the given information. Another challenge is the need to
  extract relevant information from a large and potentially unstructured dataset.Despite these challenges, QA
  systems have a wide range of applications, including education, customer service, and health care.

  summary: Question answering (QA) systems are a type of natural language processing (NLP) technology that provide precise and concise answers to questions posed in natural language. rule-based,
  information retrieval, and machine learning-based are different approaches in building QA systems.One of the main challenges faced by QA systems is the need
  to understand the context and intent behind a question. Another challenge is the need toextract relevant information from a large and potentially unstructured dataset. QA
  systems have a wide range of applications, including education, customer service, and health care. By understanding the current state of development and the potential impact of QA systems, we can better utilize these technologies to
  improve various industries and enhance the way we access information.[/INST]
  [INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Can you summarize the following text 
  Context : Jupiter is the fifth planet from the Sun and the largest in the Solar System. It is a gas giant with a mass one-thousandth that of the Sun, but two-and-a-half times that of all the other planets in the Solar System combined. Jupiter is one of the brightest objects visible to the naked eye in the night sky, and has been known to ancient civilizations since before recorded history. It is named after the Roman god Jupiter.[19] When viewed from Earth, Jupiter can be bright enough for its reflected light to cast visible shadows,[20] and is on average the third-brightest natural object in the night sky after the Moon and Venus.
  Jupiter is primarily composed of hydrogen with a quarter of its mass being helium, though helium comprises only about a tenth of the number of molecules. It may also have a rocky core of heavier elements,[21] but like the other giant planets, Jupiter lacks a well-defined solid surface. Because of its rapid rotation, the planet's shape is that of an oblate spheroid (it has a slight but noticeable bulge around the equator)
  Summary : 
  .[/INST]'''
  device = "cuda:0"

  inputs = tokenizer(text, return_tensors="pt").to(device)
  outputs = model.generate(**inputs, max_new_tokens=400)
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))

OUTPUT :

Jupiter is the largest planet in the Solar System, a gas giant with a mass one-thousandth that of the Sun but two-and-a-half times that of all other planets combined. It is named after the Roman god Jupiter and is known to ancient civilizations. Jupiter is bright enough to cast visible shadows when viewed from Earth and is on average the third-brightest natural object in the night sky after the Moon and Venus. The planet is primarily composed of hydrogen with a quarter of its mass being helium, and may also have a rocky core of heavier elements. Due to its rapid rotation, Jupiter has an oblate spheroid shape, with a slight bulge around the equator.

Few Shot Inference

In Few shot inference, an LLM is provided with more than one example of prompt-response pairs that match our task - before the actual prompt that we want completed.

def get_few_shot_inference():
  text='''[INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Context : Context: NLP Cloud was founded in 2021 when the team realized there was no easy way to reliably leverage Natural Language Processing in production.
  Question: When was NLP Cloud founded?
  Answer: 2021 [/INST]
  [INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Context: NLP Cloud developed their API by mid-2020 and they added many pre-trained open-source models since then.
  Question: What did NLP Cloud develop?
  Answer: API [/INST]
  [INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Context: All plans can be stopped anytime. You only pay for the time you used the service. In case of a downgrade, you will get a discount on your next invoice.
  Question: When can plans be stopped?
  Answer: Anytime [/INST]
  [INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Context: The main challenge with GPT-J is memory consumption. Using a GPU plan is recommended.
  Question: Which plan is recommended for GPT-J?
  Answer: GPU plan [/INST]
  [INST]<<SYS>>
  You are a helpful  assistant that provides information based on the given query in context.
  <</SYS>>
  Context:The main advantage with GPT-J is less pricing. .
  Question:What is the main advantage for GPT-J?
  Answer:
  .[/INST]'''
  device = "cuda:0"

  inputs = tokenizer(text, return_tensors="pt").to(device)
  outputs = model.generate(**inputs, min_length=1,max_new_tokens=1000)
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))

OUTPUT :
Less pricing

Code link

Decoding the Wonders of Natural Language Processing

Discussion about this post