May 25, 2023·edited May 25, 2023Liked by Jacob Bayless

Chatbots are random processes, so it's important to run each prompt at least a few times to see how consistent it is. (Ideally we'd use the API and do many trials.) I re-ran the river crossing prompt a couple more times on GPT-4 and reproduced your results, but worded differently each time. Interestingly, on the last attempt it explicitly said it recognized the problem:

---

This is a variant of the classic River Crossing puzzle, where you need to transport everyone across without certain groups being left together. Here's a possible solution to the problem:

First, the Earthling takes the Venutian to Earth via the wormhole, then the Earthling returns alone to the Galactic Federation. This leaves the Martian and the Moonling at the Galactic Federation, who are safe together, and the Venutian alone on Earth.

Next, the Earthling takes the Martian to Earth, leaving the Moonling alone at the Galactic Federation. But, she can't leave the Martian with the Venutian on Earth, so she leaves the Martian on Earth and takes the Venutian back to the Galactic Federation.

Then, she leaves the Venutian at the Galactic Federation and takes the Moonling to Earth. This leaves the Venutian and Martian at the Galactic Federation, who are safe together.

Finally, the Earthling returns to the Galactic Federation one last time to bring the Venutian to Earth.

Now, all of them, the Earthling, the Martian, the Venutian, and the Moonling are on Earth, and at no point were the Venutian and the Martian left alone together, nor the Venutian and the Moonling. The Earthling has successfully navigated the wormhole puzzle!

---

If someone wanted to explore this further, they could try varying the prompt to see when it breaks down. If we tell it that the the wolf is a vegetarian and the goat attacks wolves, does it get confused? What if it's given an unsolvable problem?

This is a little different because it's about in-context learning, but I recently saw an interesting paper about how if you give an LLM a problem with backwards labels, it will get confused. See: https://thegradient.pub/in-context-learning-in-context/

---

Edit: In general, I think this is a good post but the headline is overstated. It seems like the boundary between reasoning and statistics gets rather fuzzy once you allow for copying with substitutions, and LLM's are all about copying with substitutions. They're usually not copying word-for-word, but it's often unclear what level of abstraction they're working at.

Excellent points and I agree with everything you say. Inventing an entirely new riddle would be a stronger test. And yes, I didn't mean for the headline to be so provocative, and I didn't want to imply that GPT-4 is a general intelligence or capable of arbitrarily reflective thought processes or something. "Reasoning" isn't really a well-defined term; I mainly just wanted to make the point that transformers are nothing like a stochastic parrot, which is seemingly how most people are still imagining them.

I think Weiss' paper Thinking Like Transformers (linked from footnote 13) best describes the type of thought that transformers can do. It's still hard to say what level of abstraction GPT-4 is working at, with it being a large and unknown depth, but it's clear that transformers, in training, can learn abstractions, and learn how and when to symbolically manipulate those abstractions.

I'll just leave this uncompleted but provoking thought: if you can do sorting over abstract embeddings, then necessarily you can define comparison operator on those embeddings and construct an ordered set. Does this mean a transformer can learn to express coherent preferences over world states?

The inconsistent answers from human brain could be a flaw than merit. Human body introduces too much noises to the process, e.g., the surroundings, the health condition, there are all kinds of random inputs to the neural system contribute to the outcome. If LLMs hook up bunch of sensors, you could even make them the meaningful like warm and cold, energization or tiredness, it will perform more like human. But is that what humans want?

No it's not able to sort, it's merely applying statistics to the input.

Even Yann LeCun states that GPT-4 is a statistical correlations engine.

In simple terms, it sees multiple times a sequence of characters (say "42" for example) that it puts it in the output, and the output is ordered (sorted) because the neurons are reinforced to output "42" after "41", and "41" after "40" and so on and so forth, and keep the same amount. That's all it does!

The reason why it failed on 59 is because it has seen that number too few times at the first position is all.

*It's only a large statistical model*

You can easily determine if it had internally developed an algorithm to sort number by not giving one number in the input samples (e.g. 37) and train it, then give it 37 in some input and see whether it can sort it out or not.

I already know the answer: the model doesn't have a sense of ordering numbers expressed in base 10 (decimal), it just has a sense of patterns: it matches groups of digits and produce the output that will result in the lowest error rate. Again, that's only statistics.

Thanks for the thoughtful comment! I appreciate it, but I unfortunately have to take issue with the argument.

Your premise is correct; if you don't give the training data any inputs containing '37', it won't have any way to handle '37' when it shows up in an input sequence. But if I asked you to sort the sequence "5, 3, 3, 18, 5, 7, ᛝᛈ, 4, 8", you too would probably have a difficult time figuring out what to do with the "ᛝᛈ". In the experiment I described, each value is given its own token, and as far as nanoGPT is concerned they only just-so-happen to take the shape of Arab numerals; there is no built-in notion of ordering between tokens. The transformer needs to learn that for itself.

Humans learn the ordering of Arabic numerals in preschool, but that ordering was invented by Arab mathematicians, not baked into the fabric of the universe. A typical sorting algorithm written in C benefits from built-in hardware support for numeric datatypes, where the ordering of numbers is baked in at the transistor level via the comparison operators "<,>, <=, >=, ==" as well as arithmetic like "++" and "--". Without such operations defined, algorithms like Quicksort would not function. That's the blank-state starting point that our NanoGPT transformer gets initialized with, so let's cut it a bit of slack if it doesn't know how to sort a token it's never seen before.

I also wouldn't dismiss so easily the idea that "42" simply comes after "41" and "40" and so on. That's ordering, but sorting is much harder -- you need to make sure that if "42" showed up three times across in the input sequence, and "40" showed up twice, and "41" zero times, then the output contains the sequence "40, 40, 42, 42, 42", with not-too-many of any number and not-too-few of any number, and of course, never following a larger number with a smaller one. Anything else is incorrect. It might look like simple statistics at first glance, but when you actually dig into what's involved to get counting and sorting both correct at the same time, the curse of dimensionality quickly rears its ugly head. If we decompose the data in terms of correlations, then there are very high-dimensional n-gram correlations at work; we have to look across the entire input list to determine every single output character; nothing can be ignored. Try to sort a list 50 characters long, one character at a time with a pencil and no scratch paper, and you'll see the step-by-step process that a trained transformer has had embedded into its weights. It's not a simple one.

But even if we wanted to train on those high-dimensional correlations, there is not enough data to do so without exploiting the abstractions and symmetries inherent in the task. The chance that the training data somewhere contained any particular sub-sequence that matches the input random list becomes *vanishingly* small as the sub-sequence length grows; 6 numbers is about the limit. It's hard to overemphasize this point. With GPT-4, the training data is not publicly known so it's easy to imagine that it's near-infinite, and any clever-sounding response was actually buried in a Reddit comment somewhere. But in this experiment, I know exactly how many sequences the network trained on, because I generated them!

But even if we trained on a complete list of every possible number sequence of up to 127 characters, the relatively small number of neural network parameters in this sorting-GPT could not possibly be memorizing every correlation! There's simply not enough degrees of freedom.

This has been a long rebuttal, but just one more point: However we end up with some particular model weights -- whether through gradient descent and lots of data, or through compiling a provably-correct RASP program written by a thoughtful human scientist -- that all happens on the training side. Let's now set aside how we arrived at those values, and assume we are somehow given a model and a set of weights that, when run at temperature 0, correctly and deterministically sort all input lists given to it -- I would call that a "sorting program". Would you not? And if I happen to arrive at those weights by gradient descent, and you arrived it by compiling RASP code, it's still the same program either way, is it not?

I don't claim to have shown that nanoGPT produces nearly-equivalent weights to a compiled RASP sorting program (that's more time than I can afford to spend on this fun experiment), but I think it still cuts to the heart of the philosophical question. If a computer program generated by a statistical gradient-descent process ends up, by the unrelenting demands of compression, identifying and leveraging the key abstractions that existed in the data (such as notions of "count", "order", and "sort"), then it's the same to me.

With this lengthy explanation you're beating around the bush: because the constraints you've set are so high (sorting) what you've achieved is overfitting. The network is now ultra-specialized at producing a few correct outputs based on corresponding few select inputs, but it can't do more than that (like generalization).

If you change one or two things in the input, for example duplicate numbers or shuffle them around, the output certainly and invariably will be wrong because the input is then too far from what the overfit network is expecting.

So yeah you can seemingly make a NN learns anything complex and deterministic using overfitting, as long as you don't ask it to generalize in the same problem class. And that's what you've achieved here: it can produce the right sorted output for a fixed small set of inputs, and that's all there is.

(TLDR: they show that sorting is one of the operations for which the Transformer architecture is a natural prior, ie it can be expressed with just 1-2 layers if we sort token sequences, and about 3-4 for sequences of multi-token numbers)

Still impressive, but also means that it's not a particularly mind-blowing example, ie it's not inventing quicksort or the like, as the network doesn't even need has to do the pattern recognition and substitution as for the river crossing puzzle

I agree, in some ways that means the experiment is less surprising. Although, it seems that "Thinking Like Transformers" is still not widely acknowledged, so it's worth publicizing the result anyway because even people who are deep within the LLM research community still seem to be surprised by it. See for example here (https://arxiv.org/abs/2308.09687) where sorting was also used as a performance challenge.

I think, for my purposes, it still serves its essential purpose as a fundamental challenge to the "stochastic parrot" model of thinking about LLMs. That sorting can be expressed naturally as a RASP program makes it even more likely (by Ockham's razor) that a trained-to-sort-transformer generates its output through a process whose procedural steps, rather than just their input-output behaviour, are an approximation to a deterministic algorithm. And conversely very unlikely to be stochastically parroting number sequence completions that it remembers from training. Which, based on the storage size argument, would also have been impossible, at least for any formal interpretation of "stochastic parrot" that I can think of.

(And, just to level the playing field, I'll acknowledge that even a quicksort program doesn't sort an input list by inventing quicksort either -- it merely *executes* quicksort. Since I trained this network to output the sorted input, rather than outputting a program that sorts the input, it's only to be expected that the weights would *implement* a sorting program, which means the invention of that sorting program happened during the training process).

But yes I agree, in retrospect there are probably other tasks that would have been more impressive. Maybe I should try training it to do some other n-log-n task, like a Fourier transform. (Although perhaps that wouldn't impress a skeptic either, seeing as how it can be done by a piece of polished glass: https://en.wikipedia.org/wiki/Fourier_optics).

Sep 9, 2023·edited Sep 9, 2023Liked by Jacob Bayless

OK I've opened the Substack on my laptop and feel a bit silly having missed the footnote -- they're neither hoverable nor clickable on mobile >_<

Yeah I agree it is still a great example of the difference between Transformers and the Markov chain models (unsure about RNNs)

I find these papers to greatly expand upon the "Thinking like Transformers" line of thinking:

- if you need impressive examples, I think we can play with Transformers Learn Shortcuts to Automata (https://arxiv.org/pdf/2210.10749.pdf) where they show that not just sorting but a quite broad class of algorithms can be encoded in attention + scatter/gather operations, /and/ learned with SGD

...but those ain't "reflective" in the theory-of-mind sense, more like "effortlessly riding the bicycle" if you forgive me antropomorphizing the model, but if we modify the architecture a bit we get Trainable Transformer in Transformer (https://arxiv.org/pdf/2307.01189.pdf) - slight modifications suffice to model the learning/fine-tuning of another Transformer /during inference/ with <10x growth in number of parameters.

> even people who are deep within the LLM research community still seem to be surprised by it

This is sadly to be expected given the amount of noise and the overall amount of papers. In less newsworthy areas like compiler/programming language research I've seen communities that discover the same concepts independently, calling them different names but not citing each other.

And they def can't be accused of plagiarizing given the depth of the papers, more like it's genuinely hard to grok that the different line of thinking describes the same phenomena when they aren't grounded in physical world experiments but operate on abstract concepts which can't be related as easily

edited May 25, 2023Chatbots are random processes, so it's important to run each prompt at least a few times to see how consistent it is. (Ideally we'd use the API and do many trials.) I re-ran the river crossing prompt a couple more times on GPT-4 and reproduced your results, but worded differently each time. Interestingly, on the last attempt it explicitly said it recognized the problem:

---

This is a variant of the classic River Crossing puzzle, where you need to transport everyone across without certain groups being left together. Here's a possible solution to the problem:

First, the Earthling takes the Venutian to Earth via the wormhole, then the Earthling returns alone to the Galactic Federation. This leaves the Martian and the Moonling at the Galactic Federation, who are safe together, and the Venutian alone on Earth.

Next, the Earthling takes the Martian to Earth, leaving the Moonling alone at the Galactic Federation. But, she can't leave the Martian with the Venutian on Earth, so she leaves the Martian on Earth and takes the Venutian back to the Galactic Federation.

Then, she leaves the Venutian at the Galactic Federation and takes the Moonling to Earth. This leaves the Venutian and Martian at the Galactic Federation, who are safe together.

Finally, the Earthling returns to the Galactic Federation one last time to bring the Venutian to Earth.

Now, all of them, the Earthling, the Martian, the Venutian, and the Moonling are on Earth, and at no point were the Venutian and the Martian left alone together, nor the Venutian and the Moonling. The Earthling has successfully navigated the wormhole puzzle!

---

If someone wanted to explore this further, they could try varying the prompt to see when it breaks down. If we tell it that the the wolf is a vegetarian and the goat attacks wolves, does it get confused? What if it's given an unsolvable problem?

This is a little different because it's about in-context learning, but I recently saw an interesting paper about how if you give an LLM a problem with backwards labels, it will get confused. See: https://thegradient.pub/in-context-learning-in-context/

---

Edit: In general, I think this is a good post but the headline is overstated. It seems like the boundary between reasoning and statistics gets rather fuzzy once you allow for copying with substitutions, and LLM's are all about copying with substitutions. They're usually not copying word-for-word, but it's often unclear what level of abstraction they're working at.

Excellent points and I agree with everything you say. Inventing an entirely new riddle would be a stronger test. And yes, I didn't mean for the headline to be so provocative, and I didn't want to imply that GPT-4 is a general intelligence or capable of arbitrarily reflective thought processes or something. "Reasoning" isn't really a well-defined term; I mainly just wanted to make the point that transformers are nothing like a stochastic parrot, which is seemingly how most people are still imagining them.

I think Weiss' paper Thinking Like Transformers (linked from footnote 13) best describes the type of thought that transformers can do. It's still hard to say what level of abstraction GPT-4 is working at, with it being a large and unknown depth, but it's clear that transformers, in training, can learn abstractions, and learn how and when to symbolically manipulate those abstractions.

I'll just leave this uncompleted but provoking thought: if you can do sorting over abstract embeddings, then necessarily you can define comparison operator on those embeddings and construct an ordered set. Does this mean a transformer can learn to express coherent preferences over world states?

The inconsistent answers from human brain could be a flaw than merit. Human body introduces too much noises to the process, e.g., the surroundings, the health condition, there are all kinds of random inputs to the neural system contribute to the outcome. If LLMs hook up bunch of sensors, you could even make them the meaningful like warm and cold, energization or tiredness, it will perform more like human. But is that what humans want?

edited Apr 5No it's not able to sort, it's merely applying statistics to the input.

Even Yann LeCun states that GPT-4 is a statistical correlations engine.

In simple terms, it sees multiple times a sequence of characters (say "42" for example) that it puts it in the output, and the output is ordered (sorted) because the neurons are reinforced to output "42" after "41", and "41" after "40" and so on and so forth, and keep the same amount. That's all it does!

The reason why it failed on 59 is because it has seen that number too few times at the first position is all.

*It's only a large statistical model*

You can easily determine if it had internally developed an algorithm to sort number by not giving one number in the input samples (e.g. 37) and train it, then give it 37 in some input and see whether it can sort it out or not.

I already know the answer: the model doesn't have a sense of ordering numbers expressed in base 10 (decimal), it just has a sense of patterns: it matches groups of digits and produce the output that will result in the lowest error rate. Again, that's only statistics.

edited Apr 7Thanks for the thoughtful comment! I appreciate it, but I unfortunately have to take issue with the argument.

Your premise is correct; if you don't give the training data any inputs containing '37', it won't have any way to handle '37' when it shows up in an input sequence. But if I asked you to sort the sequence "5, 3, 3, 18, 5, 7, ᛝᛈ, 4, 8", you too would probably have a difficult time figuring out what to do with the "ᛝᛈ". In the experiment I described, each value is given its own token, and as far as nanoGPT is concerned they only just-so-happen to take the shape of Arab numerals; there is no built-in notion of ordering between tokens. The transformer needs to learn that for itself.

Humans learn the ordering of Arabic numerals in preschool, but that ordering was invented by Arab mathematicians, not baked into the fabric of the universe. A typical sorting algorithm written in C benefits from built-in hardware support for numeric datatypes, where the ordering of numbers is baked in at the transistor level via the comparison operators "<,>, <=, >=, ==" as well as arithmetic like "++" and "--". Without such operations defined, algorithms like Quicksort would not function. That's the blank-state starting point that our NanoGPT transformer gets initialized with, so let's cut it a bit of slack if it doesn't know how to sort a token it's never seen before.

I also wouldn't dismiss so easily the idea that "42" simply comes after "41" and "40" and so on. That's ordering, but sorting is much harder -- you need to make sure that if "42" showed up three times across in the input sequence, and "40" showed up twice, and "41" zero times, then the output contains the sequence "40, 40, 42, 42, 42", with not-too-many of any number and not-too-few of any number, and of course, never following a larger number with a smaller one. Anything else is incorrect. It might look like simple statistics at first glance, but when you actually dig into what's involved to get counting and sorting both correct at the same time, the curse of dimensionality quickly rears its ugly head. If we decompose the data in terms of correlations, then there are very high-dimensional n-gram correlations at work; we have to look across the entire input list to determine every single output character; nothing can be ignored. Try to sort a list 50 characters long, one character at a time with a pencil and no scratch paper, and you'll see the step-by-step process that a trained transformer has had embedded into its weights. It's not a simple one.

But even if we wanted to train on those high-dimensional correlations, there is not enough data to do so without exploiting the abstractions and symmetries inherent in the task. The chance that the training data somewhere contained any particular sub-sequence that matches the input random list becomes *vanishingly* small as the sub-sequence length grows; 6 numbers is about the limit. It's hard to overemphasize this point. With GPT-4, the training data is not publicly known so it's easy to imagine that it's near-infinite, and any clever-sounding response was actually buried in a Reddit comment somewhere. But in this experiment, I know exactly how many sequences the network trained on, because I generated them!

But even if we trained on a complete list of every possible number sequence of up to 127 characters, the relatively small number of neural network parameters in this sorting-GPT could not possibly be memorizing every correlation! There's simply not enough degrees of freedom.

This has been a long rebuttal, but just one more point: However we end up with some particular model weights -- whether through gradient descent and lots of data, or through compiling a provably-correct RASP program written by a thoughtful human scientist -- that all happens on the training side. Let's now set aside how we arrived at those values, and assume we are somehow given a model and a set of weights that, when run at temperature 0, correctly and deterministically sort all input lists given to it -- I would call that a "sorting program". Would you not? And if I happen to arrive at those weights by gradient descent, and you arrived it by compiling RASP code, it's still the same program either way, is it not?

I don't claim to have shown that nanoGPT produces nearly-equivalent weights to a compiled RASP sorting program (that's more time than I can afford to spend on this fun experiment), but I think it still cuts to the heart of the philosophical question. If a computer program generated by a statistical gradient-descent process ends up, by the unrelenting demands of compression, identifying and leveraging the key abstractions that existed in the data (such as notions of "count", "order", and "sort"), then it's the same to me.

edited Apr 7With this lengthy explanation you're beating around the bush: because the constraints you've set are so high (sorting) what you've achieved is overfitting. The network is now ultra-specialized at producing a few correct outputs based on corresponding few select inputs, but it can't do more than that (like generalization).

If you change one or two things in the input, for example duplicate numbers or shuffle them around, the output certainly and invariably will be wrong because the input is then too far from what the overfit network is expecting.

So yeah you can seemingly make a NN learns anything complex and deterministic using overfitting, as long as you don't ask it to generalize in the same problem class. And that's what you've achieved here: it can produce the right sorted output for a fixed small set of inputs, and that's all there is.

edited Apr 7What makes you think the network is overfitting? The sorting test inputs are not taken from the training data.

Great observations and a nice experiment design!

Have you seen the "Thinking like Transformers" article (https://arxiv.org/abs/2106.06981) and the RASP language?

(TLDR: they show that sorting is one of the operations for which the Transformer architecture is a natural prior, ie it can be expressed with just 1-2 layers if we sort token sequences, and about 3-4 for sequences of multi-token numbers)

Still impressive, but also means that it's not a particularly mind-blowing example, ie it's not inventing quicksort or the like, as the network doesn't even need has to do the pattern recognition and substitution as for the river crossing puzzle

Thank you, and indeed -- see footnote 13!

I agree, in some ways that means the experiment is less surprising. Although, it seems that "Thinking Like Transformers" is still not widely acknowledged, so it's worth publicizing the result anyway because even people who are deep within the LLM research community still seem to be surprised by it. See for example here (https://arxiv.org/abs/2308.09687) where sorting was also used as a performance challenge.

I think, for my purposes, it still serves its essential purpose as a fundamental challenge to the "stochastic parrot" model of thinking about LLMs. That sorting can be expressed naturally as a RASP program makes it even more likely (by Ockham's razor) that a trained-to-sort-transformer generates its output through a process whose procedural steps, rather than just their input-output behaviour, are an approximation to a deterministic algorithm. And conversely very unlikely to be stochastically parroting number sequence completions that it remembers from training. Which, based on the storage size argument, would also have been impossible, at least for any formal interpretation of "stochastic parrot" that I can think of.

(And, just to level the playing field, I'll acknowledge that even a quicksort program doesn't sort an input list by inventing quicksort either -- it merely *executes* quicksort. Since I trained this network to output the sorted input, rather than outputting a program that sorts the input, it's only to be expected that the weights would *implement* a sorting program, which means the invention of that sorting program happened during the training process).

But yes I agree, in retrospect there are probably other tasks that would have been more impressive. Maybe I should try training it to do some other n-log-n task, like a Fourier transform. (Although perhaps that wouldn't impress a skeptic either, seeing as how it can be done by a piece of polished glass: https://en.wikipedia.org/wiki/Fourier_optics).

edited Sep 9, 2023OK I've opened the Substack on my laptop and feel a bit silly having missed the footnote -- they're neither hoverable nor clickable on mobile >_<

Yeah I agree it is still a great example of the difference between Transformers and the Markov chain models (unsure about RNNs)

I find these papers to greatly expand upon the "Thinking like Transformers" line of thinking:

- if you need impressive examples, I think we can play with Transformers Learn Shortcuts to Automata (https://arxiv.org/pdf/2210.10749.pdf) where they show that not just sorting but a quite broad class of algorithms can be encoded in attention + scatter/gather operations, /and/ learned with SGD

...but those ain't "reflective" in the theory-of-mind sense, more like "effortlessly riding the bicycle" if you forgive me antropomorphizing the model, but if we modify the architecture a bit we get Trainable Transformer in Transformer (https://arxiv.org/pdf/2307.01189.pdf) - slight modifications suffice to model the learning/fine-tuning of another Transformer /during inference/ with <10x growth in number of parameters.

and an interesting interpretability toolkit

a) https://arxiv.org/pdf/2306.01128.pdf (Learning Transformer Programs, extract RASP-like code from transformer weights)

b) https://arxiv.org/pdf/2301.05062.pdf (Tracr, RASP-to-weights compiler)

> even people who are deep within the LLM research community still seem to be surprised by it

This is sadly to be expected given the amount of noise and the overall amount of papers. In less newsworthy areas like compiler/programming language research I've seen communities that discover the same concepts independently, calling them different names but not citing each other.

And they def can't be accused of plagiarizing given the depth of the papers, more like it's genuinely hard to grok that the different line of thinking describes the same phenomena when they aren't grounded in physical world experiments but operate on abstract concepts which can't be related as easily