Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation_template #917

Open
showgood880702 opened this issue Nov 6, 2024 · 4 comments
Open

Conversation_template #917

showgood880702 opened this issue Nov 6, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@showgood880702
Copy link

Could you tell me how to use this conversation_template in the chatbot? I used a training dataset that follows the Llama-3 conversation_template, but there doesn’t seem to be an argument to set this conversation_template in the chatbot.py. Should I use --prompt_structure to include the Llama-3 template as an argument?

FYI, when training on Llama-3, should my dataset always follow its conversation_template?

Thank you so much.

@wheresmyhair
Copy link
Collaborator

Hi, first thanks for your interest in LMFlow! Regarding to your questions:

  1. conversation_template only works for model training (finetuning) + conversation dataset (i.e., "type": "conversation" in the .json file), and it is responsible for adding special tokens so that you don't need to adding those according to different models. See here for a dataset example, or you could
cd data
bash download.sh alpaca

and take the json file in train_conversation as a reference.

  1. For inference, you may try the following codes taken from llama hf repo for a temporary use:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

The chatbot.py is outdated and we're planning to upgrade it. As of now, it is not compatible with instruction/chat models. Sorry for the inconvenience.

@wheresmyhair wheresmyhair added the enhancement New feature or request label Nov 7, 2024
@showgood880702
Copy link
Author

Thank you for the explanation. However, I'm still a bit confused about the conversation dataset structure. For the training dataset, should I put the templated dataset as {"type": "text_only", "instances": conversation_template}? It confuses me how I’m supposed to put data into {"type":"conversion","instances":[]} since it’s already a conversation template.

@wheresmyhair
Copy link
Collaborator

wheresmyhair commented Nov 8, 2024

If the data is already templated, you could choose base on the expected behavior.
The reason why we design conversation dataset is that we want to not only do the tokenization and templating but also mask the user inputs, system prompts, and tool information, since model can see them all at once and there's no need to generate them autoregressively. In other words, you do not need to train_on_prompt. The conversation dataset also supports multi-round conversations, and the mask will look like [1,1,1,1,0,0,0,1,1,1,0,0,0], say, for a conversation that has two rounds.

You can use text_only dataset type if you've already organized your conversation in one string. The json file then should look like:

{
  "type": "text_only",
  "instances": [
    	{"text": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am a chatbot developed by LMFlow team.<|eot_id|>"},
    	{"text": "SOME_OTHER_TEMPLATED_TEXT_2"},
    	{"text": "SOME_OTHER_TEMPLATED_TEXT_3"},
  ]
}

However, we cannot mask on prompt in this case, since it is extremely hard to parse out the tokens that should be masked. In other words, you do train_on_prompt.

Alternatively, text2text dataset will mask all content in input. If it's a single round conversation, it should be fine (no difference between a templated text2text dataset and conversation dataset once you set conversation_template correctly).

{
  "type": "text2text",
  "instances": [
    {
        "input": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "I am a chatbot developed by LMFlow team.<|eot_id|>",
    },
    {
        "input": "SAMPLE_INPUT_2",
        "output": "SAMPLE_OUTPUT_2",
    },
    {
        "input": "SAMPLE_INPUT_3",
        "output": "SAMPLE_OUTPUT_3",
    },
  ]
}

@showgood880702
Copy link
Author

Thank you for your explanation.
I still have a question about how to build a chatbot. Whether I use the "type": "conversation" type of conversation data or the LLAMA3 template, will it affect the way I build the chatbot? Also, the codes below seem unable to create a multi-round conversation:

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants