Team motto: “I can’t fully understand what I can’t create.” This is a final project for the NLP course (CS554) at WPI. Team members: Zhiyang Zhang & Vivek Choudhary, Jixing Zhou, Amrut Savadatti Special guests: Ningcong Chen(Our Tech Consultant), Shivam Shinde(A Tech Blogger)

Code: https://github.com/Zhiyang-Z/CS554_NLP_Team7 ★ Our code is very direct, concise, without too many wraps, easy to understand with reference to this blog.

Stage 2: From GPT to ChatGPT

Genesis 2:18 “The Lord God said, “It is not good for the man to be alone. I will make a helper suitable for him.””

After Stage 2, model awakens to the presence of others, and begins to converse.(the appearance of Eve)

Stage 1: From Zero to GPT

Genesis 2:7 “Then the Lord God formed a man from the dust of the ground and breathed into his nostrils the breath of life, and the man became a living being.”

After Stage 1, the knowledgeable model speaks endlessly, unaware of any other soul.(the lonely Adam)

★ The most important training technics have been covered in Stage 1: From Zero to GPT. Technics used there are also suitable for Stage 2, Supervised Fine Tuning(SFT).

1. Finetune to converse

In this stage, we fine-tune the pretrained model to handle real conversations—specifically, multi-turn interactions. Unlike pretraining, where the model simply predicts the next token without true dialogue structure, supervised fine-tuning (SFT) teaches the model how to engage in back-and-forth conversation. As the name suggests, SFT requires training the model on datasets composed of conversational text. To help the model understand and differentiate the roles within a dialogue, we extend the tokenizer with special tokens that mark each speaker. In a typical human–machine conversation, there are two roles: the human and the assistant. We also need tokens to explicitly mark the start and end of each role’s message. In total, we introduce four new special tokens:

<|user_start|>: Mark human text start
<|user_end|>: Mark human text end
<|assistant_start|>: Mark machine text start
<|assistant_end|>: Mark machine text end

⚠️ Caveat 1: In today’s advanced language models, conversations often involve more than just two roles. Many models introduce extra special tokens to support specialized behaviors or task-specific prompting. For example:

<|system_start|> and <|system_end|>Used for system-level instructions that guide the model’s behavior—such as summarization, rewriting, or other task-specific control prompts.
<|code_start|> and <|code_end|> Used in code-generation scenarios to clearly separate natural-language text from programming-language code, helping the model understand which mode it should operate in.
<tools> and </tools> Used in AI agent frameworks to provide the model with a list of available tools or functions it can call during execution.

In our case, to keep things simple, we only use user and assistant roles.