Close
0%
0%

Let's Build a Modern LLM - Entirely from Scratch

While there are several open source versions of various AI applications out there, there is still very little out there for doing training.

Similar projects worth following
This project will attempt to combine the features of several projects that existed more than twenty years ago, and which were created for other purposes, such as trying to predict the Lottery, or for attempting to turn audio into sheet music, for example, into a fully integrated AI, which is just as capable as anything that not so open, you know who or anyone else is doing.

If you have been around these parts long enough, you know how the classic chatbots like Eliza, Parry, and Megahal work.  Then there were other important approaches to AI, like Perceptrons, SHRDLU, Mathematica, Watson, and so on.  Some of those things have open source versions, and some do not.  One thing that they all had, and perhaps still have in common, was, and perhaps there still is the fact that there always seems to be a "great breakthrough" in AI, which almost always seems as if it were as great a leap forward, as it would be, that is to say, as if sustainable controlled fusion had also been solved.  And then of course, no, not really.  Better luck next time.

Then perhaps one of the biggest breakthroughs ever came with the landmark observation, "What if attention is all you really need?"  Purely attentional-based systems, however they actually work, do have their own sets of problems, and among those, of course, is a lack of efficiency.  And yet there are some very interesting things that can be said about "attention" in and of itself!  So maybe a digression into the history of all of this will prove to be worthwhile.

Most people reading this were not even born when Don Lancaster wrote the article for the very first build-it-yourself "TV Typewriter", as if all that mattered at the time was putting "your message on the screen."  

Ah, the TV Typewriter!

Somewhere, somehow, back in the day, beyond the valley of the shadows, in the once upon a long time ago, long before I ever wanted to write a novel where every chapter began with "It was a dark and stormy night, and as the swamp thing staggered from the crypt, suddenly there was a need for words", there was, of course, this:

Now that I have your attention.  Yeah, that thing.  Back when Bill Gates was still in high school.  Then one thing happened, and another, another, and we all know what happened.  Even if Eliza came out in the 60's, and then SHRDLU amazed so many more, while requiring a PDP-10 so as to enable a user to interact with a wire-frame virtual world that allowed the input of commands, like "Pick up the red block and place it under the blue pyramid."

Now, all of a sudden, here it is, 2026, and vibe coding seems to be the new meme.  Or perhaps the new mess, since obviously, we are at a crossroads.  Some people are saying that up to 50% of white-collar jobs will be gone within the next couple of years, and then what?  Will AI also somehow eliminate the need for half the world's population?  Will the presence of data centers in rural areas lead to vanishing water supplies, dead fish, and no more birds?

What if open-source training can be done "at home", and if models could be shared, Wiki-style?  Would that work?  Would that solve the data center problem?

Well, that is all good and fine, if it should turn out to be possible, but how do we know what is or is not possible, that is, until we try some things for ourselves?  So let's take a look at something different, I think, from anything that anyone else is doing, at least insofar as "open source" is concerned.  What if we wanted to try to use AI, let's say, to try to "predict the Lottery!?!"

Now some people will be wondering, of course, what does trying to predict the Lottery have to do with LLMs?  Perhaps the answer will be obvious to some, while for others, we will need to take a very deep dive into the whole theory of just how "training an LLM works", that is, if it actually does.  Well, as it turns out, there are certain Lottery systems that have a great deal in common with how LLMs work.  Well, maybe.  Like what if we had a program that looks at the draw history for a particular game that we are interested in, to find potential "hot pairs", that might be useful in trying to create not just lists of so-called hot numbers. Rather, we might want to find...

Read more »

  • The Birthday Problem and other Roads Less Travelled

    glgorman05/31/2026 at 02:21 0 comments

    Looking at the Fantasy Five Birthday problem, vs. the Power Ball birthday problem, and how to calculate the "Birthday Problem" for any Lotto-type game.  In the California Fantasy Five game, there are 39 numbers to choose from.  So, if someone's birthday was 12-25, then what is the chance that 12-25 will come up on the California Fantasy Five, which, as stated, has 39 numbers, vs. the Power Ball, which has 69 numbers?  According to my classic HP-15 calculator, the birthday probability for the California Fantasy 5 game is approximately 1 in 74.10, whereas the "Birthday Probability" for the Powerball is approximately 1 in 234.60.

    One way to do this calculation is, of course, to use the "choose function C" to compute the value C(39,5)/C(37,3) for the California Fantasy Five, or to likewise simply compute C(69,5)/C(67,3) for the Powerball, obtaining the 234.60 value in that case, as stated.  So this would appear that we could find the general case for any Lotto-type game as C(M, N)/(M-2, N-2), which uses either a bit of multiplication and division to get done, or else we could perhaps generalize it a bit further, by discovering a pattern that seems to exist in the calculations, when it is written out by hand, so that it also presents an opportunity to perform some optimization. 

    Thus, it would appear that the problem of finding the probability of having two specific numbers come up in a five-numbered draw set actually generalizes to the value: N*(N-1)/20.  This, of course, might have some interesting implications.

    With an LLM, of course, we are going to want to replace some Lottery concepts like "draw set" with "reading frame," and we are going to want to think about the idea of grammars that might have 10's of thousands of symbols or more, so that maybe problems of this type are going to turn out to be important when we think about memory allocation, thread pool size, and so on.  Yet drawing from the examples of "down the rabbit hole", "down the primrose path", "down the hatch," and "down the drain", it should be easy to contemplate that there should be some way of developing metrics for measuring the probability that two randomly chosen symbols will occur within some context, so as to be able to do some work in the problem of scaling things like "neuronal weights" associated with relevance, or significance, that is, within the framework of the hopefully more general transformer based attentional systems.

    Maybe.  Or we can go back to asking "What happened to the other dollar?"

  • Which way do we go from here?

    glgorman05/24/2026 at 16:23 0 comments

    There is a whole bunch of stuff that I want to do, like talk about the "phone book trick from the movie Rain Man", and how there is an urban legend that it was actually done by the magician and memory expert Harry Lorayne, who was also one of the authors of a best-selling book entitled "The Memory Book."  Even if Lorayne didn't actually memorize the entire Manhattan telephone directory, it was said that he did memorize quite a few pages, just to prove that it could be done.  

    This is something that I want to talk about in a different context, i.e., with respect to Lottery systems, whether they work or not.  Yet they will have utility, I think, in finding interesting and novel ways to approach the problem of designing an LLM from scratch.

    Since this is going to take a while to explain, perhaps it would be best to start with an excerpt directly from the book, or just simply borrow a description for one popular system for memorizing numbers from the Wikipedia article on the so-called "Mnemonic Major System."

    So here is "The system", shamelessly borrowed from Wikipedia:

    --------------------------------------------------------------------------------------------------------------------------------------------------------------

    0/s/, /z/s, soft c, zZero begins with z (and /z/). Upper case S and Z, as well as lower case s and z, have zero vertical strokes each, as with the numeral 0. The alveolar fricatives /s/ and /z/ form a voiceless and voiced pair.
    1/t/, /d/, /θ/, /ð/t, d, th (as in thing and this)Upper case T and D, as well as lower case t and d have one vertical stroke each, as with the numeral 1. The alveolar stops /t/ and /d/ form a voiceless and voiced pair, as do the similar-sounding dental fricatives /θ/ and /ð/, though some variant systems may omit the latter pair.
    2/n/nUpper case N and lower case n each have two vertical strokes and two points on the baseline.
    3/m/mLower case m has three vertical strokes. Both upper case M and lower case m each have three points on the baseline and look like the numeral 3 on its side.
    4/r/rFour ends with r (and /r/ in rhotic accents).
    5/l/lL is the Roman numeral for 50. Among the five digits of one's left hand, the thumb and index fingers also form an L.
    6/tʃ/, /dʒ/, /ʒ/, /ʃ/ch (as in cheese), j, soft g, shUpper case G looks like the numeral 6 and lower case g looks like the numeral 6 rotated 180°. Lower case script j tends to have a lower loop, like the numeral 6. In some serif fonts, upper case CH, SH and ZH each have six serifs. CHurch has six letters. The postalveolar affricates /tʃ/ and /dʒ/ form a voiceless and voiced pair, as do the similar-sounding postalveolar fricatives /ʃ/ and /ʒ/.
    7/k/, /ɡ/k, hard c, q, hard g, ch (as in loch),Both upper case K and lower case k look like two small 7s on their sides. In some fonts, the lower-right part of the upper case G looks like a 7. G is also the 7th letter of the alphabet. The velar stops /k/ and /ɡ/ form a voiceless and voiced pair.
    8/f/, /v/f, ph (as in phone), vLower case script f, which tends to have an upper and lower loop, looks like a figure-8. The labiodental fricatives /f/ and /v/ form a voiceless and voiced pair.
    9/p/, /b/p, bUpper case P and lower case p look like the numeral 9 flipped horizontally. Lower case b looks like the numeral 9 turned 180°. The labial stops /p/ and /b/ form a voiceless and voiced pair.

    --------------------------------------------------------------------------------------------------------------------------------------------------------------

    Now what do we do with it?  Well when taking a short phrase like "I LOVE A MYSTERY", or "I LIVE OUT MY DREAMS", we can easily see that the first sentence encodes the sequence 5-8-30-14 and the second sequence encodes the pattern 5-8-13-14-30, that is to say, if we have a constraint that what we want to do is to try to find a way to create some kind of "magic cookie" as it were, that somehow encodes a set...

    Read more »

  • Another Day - Another 3000 Words: Maybe

    glgorman04/29/2026 at 21:29 0 comments

    The way that I figure it, what if I could write 3000 words each day, of "new material" which is somehow relevant to training an LLM, by whatever means?  Then it would take "only" 333.33 days to get to one million words of hopefully, new, original, and just as important, potentially "useful" content.  Yet there are other things that an LLM might be trained on, like MIDI data derived from audio input.

    Or patterns in DNA, maybe?

    I mean, why not?  If it seems like a reasonable thing to try, and if the advent of vibe coding might somehow make the otherwise gargantuan task of retooling something like 100,000 lines of code, spread across at least 20 libraries, actually seem feasible.  Then maybe that is yet another domain for AI, which is not only highly important, but also relevant, even if at the same time it might seem like such explorations are perhaps being underserved, at least within the Open Source community.

    I know that I am going to want to be doing things with "transformers"; however, they actually work.  Yet, there is also the obvious, yet seemingly all too often overlooked, task of getting the data into the program, to begin with.  Maybe that isn't as hard as it might seem.

    Yet there is also this idea of generating as much "metadata" as possible, maybe, based on traditional methods.  Not just because it is possible to do, but because, maybe, just maybe, the transformer model would be a lot more efficient if some of the initial heavy lifting were done according to standard coding techniques.

    Interestingly enough.  When I took some Atari BASIC code and experimented with using a "genetic" algorithm to approximate a mathematical function, I found that I got better results if I ran 32 models in parallel, and let them compete with each other, let's say for 64 iterations each, than I was getting if I ran a single model for 2048 iterations.

    So I think that there are even more avenues that need to be explored, as far as how this sort of thing might work out in the long run.  Clearly, at least in this particular case, a competition among the so called "mixture of experts", multi-model approach, seems to work best.

  • Somewhere in the Once Upon a Long Time Ago

    glgorman04/29/2026 at 02:36 0 comments

    Right!  Whatever I just said. 

    Well, anyway.  

    What if this project isn't really about trying to predict the Lottery, and it is especially not about trying to predict any current game using a database of numbers from as long ago as 2012, which hasn't been updated in at least that long?  

    No, that's not the point.

    That is only the beginning.

    Really, an LLM, or so we are told, is not so much about words or numbers as it is said to be about "tokens".  Whatever those tokens represent is, well, whatever they represent.  Maybe macroblocks from a JPEG of a medical X-ray, or maybe waveform data from a seismograph, or "whatever."  That, of course, was the big surprise, with the discovery that "maybe attention is all you need."  Yet somehow, the "pairs of pairs" meme seems to have stuck with researchers as one of the best predictors, at least as far as text generation is concerned.  So does it make sense to try to, in effect, create some kind of "lottery-like" application that increases the number of balls in a particular game to some "reasonable number" like 65535, or something like that, along with an appropriate increase in the size of the "reading frame?"

    Maybe.  Maybe there are a lot of things that need to be tried in parallel, and it is just as important to come up with methodical techniques for formalizing the parallelization of whatever processes are being put to the test.  Something else comes to mind, such as whether to try to store the topology of a neural network in a traditional fashion, such as by extending the capabilities of Spice, or Ki-Cad, or whether it is actually much easier to do something like that with a relational database like MySQL . Yet, that of course is where the newfound meme of "vibe coding" might come in handy, that is to say, as things grow, and grow, and grow.  Which they will.

    Likewise, we might want to add or delete a source from the training set.

    So a "make system" is also needed.

    This thing is huge, but not impossible.  

    Yes, there will be back propagation and gradient descent.

    Maybe even a seemingly infinite number of monkeys, all leaping about.  Having private "conversations" with each other, and so on.

    This is an ambitious project.

    In the meantime, thank you for choosing Johnny Cab.  Hope you enjoy the ride!

View all 4 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates