Train A GPT-2 LLM, Using Only Pure C Code

[Andrej Karpathy] recently released llm.c, a project that focuses on LLM training in pure C, once again showing that working with these tools isn’t necessarily reliant on sprawling development environments. GPT-2 may be older but is perfectly relevant, being the granddaddy of modern LLMs (large language models) with a clear heritage to more modern offerings.

LLMs are fantastically good at communicating despite not actually knowing what they are saying, and training them usually relies on PyTorch deep learning library, itself written in Python. llm.c takes a simpler approach by implementing the neural network training algorithm for GPT-2 directly. The result is highly focused and surprisingly short: about a thousand lines of C in a single file. It is a highly elegant process that does the same thing the bigger, clunkier methods accomplish. It can run entirely on a CPU, or it can take advantage of GPU acceleration, where available.

This isn’t the first time [Andrej Karpathy] has bent his considerable skills and understanding towards boiling down these sorts of concepts into bare-bones implementations. We previously covered a project of his that is the “hello world” of GPT, a tiny model that predicts the next bit in a given sequence and offers low-level insight into just how GPT (generative pre-trained transformer) models work.

7 thoughts on “Train A GPT-2 LLM, Using Only Pure C Code

  1. Nice work, with plenty of documentation on training. Often these people focus too much on the tricky tech stuff when what I want to start with is ‘What does it do?’ ie deployment. I can see something vague in the docu:
    step 1/74: train loss 4.367631 (80.639749 ms)
    step 2/74: train loss 4.031242 (77.378867 ms)
    step 3/74: train loss 4.034144 (77.315861 ms)
    step 4/74: train loss 3.859865 (77.357575 ms)
    ...
    step 72/74: train loss 3.085081 (78.850895 ms)
    step 73/74: train loss 3.668018 (78.197064 ms)
    step 74/74: train loss 3.467508 (78.009975 ms)
    val loss 3.516490
    generating:
    ---
    ?Where will you go?
    I take you wherefore I can, myself, and must.
    I cast off my beak, that I may look him up on the point;
    For on his rock shall he be opencast.

    My little nephew:
    Keep on with me, my

    ….. but a 2 minute youtube demo video would have been better and then my interest level would be fired up to get into the details.

  2. But will it run on embedded? A great way to learn is to do! On the issue tracker there are “Good first issues” that anybody should be able to do. The CPU implementation also still, despite all compiler optimizations, runs 6x slower than pytorch’s CPU implementation. My favorite part is that you can listen to the source code in “electro swing”: https://x.com/dagelf/status/1777563438207631716

  3. Digital cognition uses combinations of inverters and Boulian Logic to develop models. Bio-cognition uses stereo-specific chemistry like the immune system and rubber bands to generate memory based analog models. Both can develop graphics and Nomographs that model things (like an Orrery) and can be used to analyse discrepancies. An Orrery is a self organizing map (SOM). In the semiconductor industry the cognochente use techtonic oriented written in Deep “C” to know what is going on in the FAB. SEE Nickey Joe Atchisons’s Texas Instruments and Cypress Semiconductor patents data analysis patents. The Techtonic Orries actually “KNOW” what is going on.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.