Learning to Search: Amortized Reasoning in LLMs with GFlowNets

Many tasks we care about—chain‑of‑thought (CoT) reasoning, story infilling, tool‑augmented arithmetic—are instances of intractable posterior inference inside a pretrained LLM. Common fine‑tuning strategies such as supervised learning, PPO‑style RLHF, or DPO chase one high‑reward trajectory and ignore the rest, forfeiting diversity and reliability. This post explains how Generative Flow Networks (GFlowNets) turn LLM fine‑tuning into train‑time search: the model is taught to sample complete reasoning paths with probability proportional to their joint likelihood, thereby amortizing Bayesian inference. We weave together intuition, a toy demo, and results from Hu et al. (ICLR 2024) to show why GFlowNets can be a drop‑in alternative that is (i) more data‑efficient, (ii) more robust to reward miss‑specification, and (iii) naturally enables model‑averaged predictions.

Problem & Motivation

How can we say that inference is closely related to reasoning?

Motivational Example

Can an LLM Random-Sample 0-100?

GFlowNets 101 — A Primer

Core Algorithm Box


Empirical Sections

5.1 Sentence Continuation (likelihood–diversity trade‑off)

Each subsection: setup ➜ numbers ➜ one‑line “lesson”.

5.2 Story Infilling (ROCStories)

Each subsection: setup ➜ numbers ➜ one‑line “lesson”.

5.3 Subjectivity Classification (10–50 labels, EM‑style)

Each subsection: setup ➜ numbers ➜ one‑line “lesson”.

5.4 Tool‑Augmented Arithmetic (OOD length‑generalisation)

Each subsection: setup ➜ numbers ➜ one‑line “lesson”.

Comparison Table: GFlowNet vs MLE / PPO / DPO / STaR

Axes: diversity, sample‑efficiency, reward robustness, compute cost.

What We Learned & Limitations

Strengths: posterior coverage, Bayesian model averaging, low‑data wins. Limitations: needs a decent reward LM; doesn’t add new knowledge; exploration in very long sequences; experiments $\le$ 6B params.


Fine-tuning with Chain-of-thought Reasoning

Refer to . Chain-of-thought fine-tuning can degrade the performance.

Prior research suggests that, despite generating reasoning steps before the final answer, LLMs may produce reasoning that don’t align with their internal decision-making processes, as these operate in different representational spaces (Tanneru et al., 2024; Agarwal et al., 2024; Rafailov et al., 2023).

Chain-of-thought Reasoning Without Prompting

Refer to . We can use chan-of-thought without prompting, but the performance is limited due to the combinatorial explosion of possible way of thinking.This will become a hoverable footnote.