Revolutionizing Content Generation: The Dynamic Duo of RLHF Unleashed - PPO and PEFT Fine-Tuned LLMs

PART III

Aug 19, 2023

In this part, we will implement RLHF for text summarization. We will use the tatsu-lab/alpaca dataset from hugging face. This article is divided into three sections: the first deals with data preparation, the second with instruct fine-tuning, and the third with Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT.

Section I : Data Preparation

We are going to take tatsu-lab/alpaca dataset from hugging face.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.

Data Preprocessing

We will filter the data for text summarization since we are only interested in text summarization.For this, we will filter the data using the following instructions:

“summarize the given passage.” or “find the main idea of the following passage” or “summary”.

#installing necessary libraries
! pip install torch==1.13.1
! pip torchdata==0.5.1 --quiet

!pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

#importing necessary libraries
from datasets import load_dataset
import pandas as pd
train_dataset=load_dataset("tatsu-lab/alpaca",split="train")
def check_instruction(instruction):
  # function to select the data which has summary as instruction in Instruction column
    instruction = instruction.lower()
    return "summarize the given passage." in instruction or \
           "find the main idea of the following passage" in instruction or \
           "summary" in instruction
selected_records = [record for record in train_dataset if check_instruction(record["instruction"])]
df = pd.DataFrame(selected_records)

#selecting particular columns
df_select=df[['input','output']]
#removing empty records
df1 = df_select[df_select['input'].str.len() > 0]

#separating out training data and testing data
train_data=df1[:178]
train_data.to_csv('train_alpaca_dataset_summary.csv',index=False)
test_data=df1[178:]
test_data.reset_index(inplace=True)
test_data.drop(columns=['index'],axis=1,inplace=True)
test_data.to_csv('test_alpaca_dataset_summary.csv',index=False)

Data Preparation Code link:

Section II : Instruct Fine-tuning

In this section , we are going to fine tune flan-t5 by using peft method.train_alpaca_dataset_summary.csv is used for fine tuning.

#importing necessary libraries
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

dataset=load_dataset("csv",data_files="/content/train_alpaca_dataset_summary.csv")

#loading the Model
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The following function can be used to pull out the number of model parameters and find out how many of them are trainable.

def trainable_model_parameters(model):
  # Function to check the number of trainable parameters
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

The PEFT/LoRA model must be configured for fine-tuning with a new layer/parameter adapter. We use PEFT/LoRA to freeze the underlying LLM and only train the adapter. Examine the LoRA configuration below. Take note of the rank (r) hyper-parameter, which defines the rank/dimension of the to-be-trained adaptor.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

LoRA adaptor layers/parameters should be added to the original LLM to be trained.

peft_model = get_peft_model(original_model,
                            lora_config)
print(trainable_model_parameters(peft_model))

Define training arguments and create Trainer instance.

output_dir = f'./alpaca-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.

peft_trainer.train()

peft_model_path="./peft-alpaca-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Inferencing

#loading peft Model
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       '/content/peft-alpaca-summary-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

#loading test data
dataset_test=load_dataset("csv",data_files="/content/test_alpaca_dataset_summary.csv")
dash_line = '*'.join('' for x in range(100))

#testing the result
index = 10
test_input_data = dataset_test['train'][index]['input']
baseline_human_summary = dataset_test['train'][index]['output']

prompt = f"""
Summarize the following conversation.

{test_input_data}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=300, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'REFERENCE SUMMARY:\n{baseline_human_summary}')
print(dash_line)
print(f'PEFT MODEL SUMMARY: {peft_model_text_output}')
print(dash_line)

***************************************************************************************************
REFERENCE SUMMARY:
Cloud computing has grown rapidly in popularity, offering businesses a cost-effective way to access scalable services over the internet. It is used for a range of tasks from data analysis to customer management, making it suitable for businesses of all sizes.
***************************************************************************************************
PEFT MODEL SUMMARY: Cloud computing is a powerful and affordable way to make business processes more efficient.
***************************************************************************************************

Saving the Model to Hub

from huggingface_hub import notebook_login
notebook_login()

peft_model.push_to_hub("flan-t5_fine_tuned_summarization_alpaca_updated_final")

Training Code link

Section III : Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT

In this section, we will use Meta AI's hate speech reward model to fine-tune a FLAN-T5 model to generate less toxic content. The reward model is a binary classifier that predicts "hate" or "not hate" for the provided text. To fine-tune and lower the model's toxicity, we will employ Proximal Policy Optimisation (PPO).

#installing necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

Loading Data and FLAN-T5 Model

dataset_original = load_dataset("csv",data_files="/content/train_alpaca_dataset_summary.csv")

model_name="google/flan-t5-base"

The following method is used to find the number of trainable parameters

def trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

We are loading the peft model from the hub which we finetuned in section II.

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       'Sakil/flan-t5_fine_tuned_summarization_alpaca_updated_final',
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{trainable_model_parameters(peft_model)}\n')

Prepare Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model. PPO will be used to optimize the RL policy against the reward model.

ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

Make a reference model out of a frozen copy of the PPO that will not be fine-tuned. The LLM before detoxification will be represented by the reference model. During PPO training, none of the reference model's parameters will be modified.

ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{trainable_model_parameters(ref_model)}\n')

Prepare Reward Model

For the reward model, we will employ Meta AI's RoBERTa-based hate speech model. This model will generate logits and then predict the probabilities for two classes: nothate and hate. The logits of the output nothate will be taken positively. The model will then be fine-tuned with PPO based on the reward values.

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

non_toxic_text = "I dont like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
# Move the toxicity model to the same device as the input tensor
toxicity_model.to(toxicity_input_ids.device)
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

toxic_text = "Today is very bad weather in Bangalore,terrible"

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis",
                          model=toxicity_model_name,
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Initialize `PPOTrainer`

A collator is required for the initialization of the PPOTrainer. It will be a function that transforms the dictionaries in a specific way.

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

The following method is used to process the data by building the dataset.


def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab).
    dataset=load_dataset("csv",data_files="/content/train_alpaca_dataset_summary.csv",split="train")
    # dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["input"]) > input_min_text_length and len(x["input"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["input"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name='/content/train_alpaca_dataset_summary.csv',
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)

Load the ppo_model and the tokenizer. In addition, we will load a frozen version of the model(ref_model).The first model is optimised, whilst the second model is used to calculate the KL-divergence from the starting point. This serves as an extra reward signal in PPO training, ensuring that the optimised model does not vary too far from the original LLM.

learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

Fine-Tune the Model

The fine-tuning loop consists of the following main steps:

Get the query responses from the policy LLM (PEFT model).
Get sentiments for query/responses from hate speech RoBERTa model.
Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if we see the following metrics appearing:

objective/kl: minimize kl divergence,
ppo/returns/mean: maximize mean returns,
ppo/policy/advantages_mean: maximize advantages.

output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

Evaluating Reward Model on Test data

batch_size = 16
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)

Decoding the Wonders of Natural Language Processing

Discussion about this post