Team motto: “I can’t fully understand what I can’t create.” This is a final project for the NLP course (CS554) at WPI. Team members: Zhiyang Zhang & Vivek Choudhary, Jixing Zhou, Amrut Savadatti Special guests: Ningcong Chen(Our Tech Consultant), Shivam Shinde(A Tech Blogger)

Code: https://github.com/Zhiyang-Z/CS554_NLP_Team7 ★ Our code is very direct, concise, without too many wraps, easy to understand with reference to this blog.

Stage 1: From Zero to GPT

Genesis 2:7 “Then the Lord God formed a man from the dust of the ground and breathed into his nostrils the breath of life, and the man became a living being.”

After Stage 1, the knowledgeable model speaks endlessly, unaware of any other soul(the lonely Adam).

Stage 2: From GPT to ChatGPT

Genesis 2:18 “The Lord God said, “It is not good for the man to be alone. I will make a helper suitable for him.””

After Stage 2, model awakens to the presence of others, and begins to converse(the appearance of Eve).

0. Intention

We believe you, like us, are amazed by today's ChatGPT - an incredibly knowledgeable system that can communicate with humans in a remarkably natural way. In our daily lives, it has already become an indispensable tool. Its capabilities seem to suggest that machines are beginning to exhibit something akin to intelligence. Although there is no universally accepted definition of intelligence, we can't help but feel that ChatGPT possesses it - given its astonishing ability in language, knowledge, and reasoning. Driven by curiosity about the mechanisms behind this intelligent system, we decided to explore how it works by reproducing GPT ourselves. In this blog, we will cover:

✔ Pretraining a GPT model from scratch.
✔ Aligning the pretrained model to ChatGPT (so-called Supervised Finetuning, SFT).
✘ Using Reinforcement Learning (RL) to adjust the model to fit human preferences (so-called RLHF, DPO, GRPO). (This stage will not be included in this blog)

Our compact code structure for pretraining (pretrain branch):

CS554_NLP_Team7
├─ dataloaders
│  └─ fineweb.py (Dataset for pretrain)
├─ ddp
│  ├─ ddp_main.py (Multiprocesses ramify from ddp_main func)
│  ├─ ddp_trainer.py (★★★IMPORTANT★★★ Training loop is here)
│  └─ ddp_utils.py (Routines for setting parallel training)
├─ model
│  ├─ gpt.py (★★★IMPORTANT★★★ Model definition)
│  └─ rope.py (Funcs for rotary embedding)
├─ ddp_train.py (Launch from this file)
└─ pretrain_config.yaml (Configs for pretrain)

★ For brevity, we will not revisit the fundamental concepts of Transformers, as numerous high-quality explanations already exist online. Instead, this article focuses on common implementation pitfalls and subtle details that are easy to overlook. (For a solid introduction to Transformer basics, we recommend this one.)

0. Intention

1. Dataset