I have recently implemented the GPT-2 124M parameter model using PyTorch. I used this project to learn more about the training process of an LLM by implementing GPT-2 and training it from scratch, which was part of Andrej Karpathy sir's Zero to Hero series on YouTube. The video I referred to was titled "Let's reproduce GPT-2 (124M)".

For all who are curious about my learning journey, I have added incremental Git commits based on what I learned in each iteration in this GitHub repository (Github_repo). For those interested, I have also uploaded the trained model weights to this Hugging Face repository ( Model ).

If you feel this is too long and would like me to walk you through it, please let me know in the comments or send me a DM. We can plan a call or a YT live , or something similar.

Model settings, training process and other info

I have trained it for 1 epoch which is (10B tokens of fineweb edu dataset's sample10B ) , 19073 steps. I have followed the exact hyperparameters mentioned in the karpathy's nanogpt which inturn is from gpt-3 paper. I have used 3 A100s of 40GB of vram each and trained the model for almost 6 hours to complete 1 epoch .I have acheived val loss of 4.2 from 10.9 in 1 epoch. I have also used hellswag eval (which this model is performing terribly) which is giving me approx 25% . I have rented GPUs from jarvis labs ( JarvisLabs.ai ) which costed me around 2700 approx ( 4500 approx if i include the time i spent on learning distributed model training with multiple gpus :/ ). I also have saved the model's state_dict for every 50 steps ( which is too much of data (179GB) i feel). I did this to visualize the transformer's learning process especially the attention layer's process. It will take time for me to visualize those and understand them in depth.

Summary of Flow of things that I did from the beginning