Explanations and visualisations:
- Jurafsky-Martin 9
- Jay Alammar’s blog: The Illustrated DeepSeek-R1
- The Stanford Natural Language Inference (SNLI) Corpus
- Amy Perfors, Joshua B. Tenenbaum, Thomas L. Griffiths, Fei Xu: A tutorial introduction to Bayesian models of cognitive development

Source https://archive-ouverte.unige.ch/unige:38422
The goal of reasoning is to establish the truth value or the intent of a statement. Real events can also be seen as statements since they are processed by our cognitive system to be mapped to abstract internal representation.
Common sense reasoning is intertwined with language, but the relation is non-trivial. For example, the following single sentence, as an element of natural language, contains two statements.
Sentence:
Mary has a red car.
Statements:
A statements is a structure consisting of at least one entity and one relation.
Entities:
Relations:
To test common sense reasoning, we often ask questions. When chatGPT was released as the first general-purpose conversation agent, users tested its reasoning capacities by asking:
What weights more, 1kg of feather or 1kg of iron?
In early versions, the system was returning the answer that 1kg if iron weights more, clearly not being able to take into account that the quantifier (1kg) should override the common knowledge that, if we manipulate approximately the same volume of feather and iron, iron would be much heavier. Interestingly, this is something that many humans would do as well.
In the meantime, the models have improved, but they are still easily tricked. For example:
Q: if my friend has three cars, does she have two cars?
A: Yes. If your friend has three cars, then she does have two cars, because having three
necessarily includes having at least two (and one more).
But:
Q: Does a four-legged animal have three legs?
A: No — a four-legged animal does not have three legs.
Here is a little more sophisticated example:
Q: When Winston entered the room, the clock was ticking. Was the clock in the room?
A: Short answer: Not necessarily — but usually implied.
Some time ago, the answer to this prompt used to be straight YES, but now it seems more correct. In reality, we do not know where the clock was from this context alone. Many humans would respond YES too.
But:
Q: When Winston entered the room he saw the door was locked. Was Winston locked in?
A: If he had already entered before it was locked, he might be locked in.
If he saw it from outside or it was locked after he entered, he might not be locked in.
Most of the reasoning is about establishing what is true, but this is not all. Humans apply reasoning for establishing and maintaining communication as well. For this, we use special form of sentences:
We understand questions and instructions as prompts to act instead of evaluating their truth values.
Moreover, we also recognise implicit communicative intents. For example, if we see a paper pasted on a wall, we will understand that there is a message for us. This understanding is the result of reasoning that we perform automatically and constantly.
We can say that formal logic differs form common sense reasoning in a similar way that formal languages differ from natural language. It is a system devised to establish explicit rules of reasoning: what conclusions can be drawn from what premises. The goal is to describe valid logical forms disregarding the empirical truth of the statements.
Most of the time, reasoning requires assessing the probability that some statements is true. Consider the following example from Perfors et al.:
Imagine you see your friend Sally coughing. What could explain this? One possibility
(call it h(cold)) is that Sally has a cold; another (call it h(cancer)) is that she has
lung cancer; and yet another (call it h(heartburn)) is that she has heartburn. Intuitively,
in most contexts, h(cold) seems by far the most probable, and may even be the only one that
comes to mind consciously. Why? The likelihood favors h(cold) and h(cancer) over h(heartburn)
since colds and lung cancer cause coughing, while heartburn does not. The prior, however,
favors h(cold) and h(heartburn) over h(cancer): lung cancer is thankfully rare, while colds
and heartburn are common. Thus the posterior probability – the product of these two terms
– is high only for h(cold).
The term prior comes from the paradigm known as Bayesian reasoning in which we estimate the probability of a hypothesis given an observation (also called data and evidence). This estimation can be expressed as conditional probability (also called posterior). Its form in our example looks like this:
p(cold|coughing) = (p(coughing|cold) x p(cold)) / p(coughing)
And its general form is:
P(A|B) = (P(B|A) x P(A)) / P(B)
Since coughing is observed, its probability becomes irrelevant, so we are left with p(cold|coughing) = p(coughing|cold) x p(cold). For drawing the conclusion that our friend has a cold, both probabilities in the equation need to be high. In this way, we introduce probability into assessing the truth of a statement.
This reasoning is the basis of one of the most popular machine learning algorithms: Naive Bayes classifier. Applying the same logic, the algorithm is assessing the probability of a label (hypothesis) given the input (data).
In cognitive science, Bayesian reasoning is often considered to be a computational model of how humans learn from evidence. This model is hierarchical, which means that our prior beliefs originate in more abstract knowledge. For example, our estimation of p(cold) might come not only from how many times we saw a cold but also from our general knowledge that humans can have this condition (while some other entities cannot). Once our conclusion about a particular case is validated, we update the more abstract knowledge too.

Based on the book Thinking, Fast and Slow by Daniel Kahneman.
This book shows quite convincingly that humans have two modes of thinking.
Fast:
Slow:
The following example shows a simple calculation that almost all people are able to perform:
A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball.
How much does the ball cost? _____ cents
Yet, a great majority of people will first give the same wrong answer (10 cents), and only come up with the correct answer (5 cents) if they are asked to think more carefully.
This shows that we tend to be in the fast thinking mode by default, but we are able to switch to slow thinking in some circumstances.
Due to this, we often make mistakes in reasoning:

Source https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r
For now, the only way we can enrich LLMs with some reasoning skills is to give them long chains of human reasoning as examples (in the form of text data) and train them to imitate human behaviour in this way. The models are not able to generate and test hypotheses, they are currently designed to memorise and compress the observed data. When prompted, they act automatically following the learned patterns, similar to any other programmed machine, just with much more variation in the responses to the input.
To be able to act when prompted, models need to see a lot of examples of prompts and learn to act from the example responses to the prompts. While we might experience these learnt responses as reasoning, models are simply generating the next token given the history, just like they were doing during pre-training. Continued training on pairs question - answer and instruction - acting is called instruction tuning.
Preference alignment is another kind of additional training in which models learn what kind of behaviour is good and what is not. This is a separate model trained using reinforcement learning techniques (learning from rewards instead from mistakes) and then combined with the main model to give a joint output.
It has been observed recently that models can improve their answers without updating weights. This can be achieved in two ways: