System 2 Reasoning Research: A Curated Map Through the Deliberative AI Landscape
Hook
While most developers obsess over prompt engineering, the path to truly intelligent AI might require resurrecting cognitive architectures from the 1980s—and one GitHub repository is building the roadmap.
Context
The AI community has a dirty secret: Large Language Models are essentially System 1 thinkers. They're brilliant at pattern matching, generating fluent text, and making snap judgments—but they struggle with multi-step planning, self-correction, and the kind of deliberative reasoning humans use to solve novel problems. This distinction comes from psychologist Daniel Kahneman's dual-process theory: System 1 is fast, intuitive, and automatic; System 2 is slow, analytical, and effortful.
As LLM capabilities plateau on benchmarks requiring deeper reasoning, researchers are rediscovering decades of work in cognitive architectures—systems like SOAR, ACT-R, and LIDA that were explicitly designed to model human deliberative thinking. Simultaneously, a new wave of agent frameworks (ReAct, AutoGPT, OpenAI's o1) attempts to bolt reasoning capabilities onto LLMs through prompting tricks and scaffolding. The problem? This research is scattered across cognitive science journals from the 1980s, reinforcement learning conferences from the 2010s, and arXiv preprints from last week. The open-thought/system-2-research repository emerged as a community-driven attempt to map this fragmented landscape—a living bibliography that connects classical cognitive science with bleeding-edge LLM research, organized specifically around the question of how to build machines that can actually think, not just predict.
Technical Insight
The repository's architecture is deceptively simple: a hierarchical markdown document that functions as a taxonomic index. But its real value lies in how it categorizes the System 2 reasoning space. It divides the field into five major domains: classical cognitive architectures, LLM-based agent frameworks, reasoning enhancement techniques, meta-learning approaches, and test-time compute strategies.
The classical cognitive architectures section is particularly revealing for modern developers. SOAR (State, Operator, And Result), developed at Carnegie Mellon in 1983, implements a production system where rules fire based on working memory state—essentially an explicit reasoning loop that modern agent frameworks are reinventing. ACT-R (Adaptive Control of Thought-Rational) models human cognition through declarative and procedural knowledge modules, with subsymbolic activation spreading that resembles attention mechanisms in transformers. For a developer building an LLM agent today, understanding these architectures reveals why approaches like ReAct (Reasoning + Acting) work: they're essentially implementing simplified versions of SOAR's deliberation cycle.
Consider how a classical SOAR-style reasoning loop maps to modern LLM agent design:
# Classical SOAR-inspired cycle
class SOARAgent:
def __init__(self):
self.working_memory = {}
self.production_rules = []
def deliberation_cycle(self, goal):
while not self.goal_achieved(goal):
# Elaboration: match rules to current state
applicable_rules = self.match_rules(self.working_memory)
# Decision: select next operator
selected_operator = self.select_operator(applicable_rules)
# Application: execute and update state
result = selected_operator.execute()
self.working_memory.update(result)
# Impasse detection: recognize when stuck
if not applicable_rules:
self.create_subgoal()
# Modern LLM agent implementing similar pattern
class LLMReasoningAgent:
def __init__(self, llm):
self.llm = llm
self.context = []
def reasoning_loop(self, task):
while not self.task_complete(task):
# Thought: deliberate about next action
thought = self.llm.generate(
f"Context: {self.context}\nTask: {task}\nThought:"
)
# Action: decide what to do
action = self.llm.generate(
f"{thought}\nAction (search/calculate/finish):"
)
# Observation: execute and update context
observation = self.execute_action(action)
self.context.append((thought, action, observation))
# Reflection: detect reasoning failures
if self.detect_loop() or self.detect_contradiction():
self.backtrack()
The repository highlights how modern approaches like OpenAI's o1 model essentially implement test-time compute as a form of System 2 reasoning. Instead of generating a single forward pass, these systems use techniques from AlphaGo—Monte Carlo tree search, beam search, and self-play—to explore multiple reasoning paths. The linked papers on "Process Reward Models" and "STaR (Self-Taught Reasoner)" show how to train systems to evaluate their own reasoning quality, creating an internal critic similar to ACT-R's conflict resolution mechanism.
The meta-learning section connects to a crucial insight: System 2 reasoning isn't just about better prompts, it's about systems that improve their own reasoning strategies. Papers on "Learning to Learn" and "Neural Program Synthesis" point toward architectures where agents discover heuristics, cache successful reasoning patterns, and transfer strategies across domains—exactly what classical cognitive architectures were designed to do through chunking and procedural learning.
What makes this repository particularly useful is its inclusion of recent multi-agent research. The limitations of single-model reasoning become obvious when you examine papers on debate frameworks, collaborative problem-solving, and agent societies. These approaches distribute System 2 reasoning across multiple LLM instances—one proposes solutions, another critiques, a third synthesizes—mirroring how human deliberation often requires dialogue and perspective-taking. The repository links to implementations like MetaGPT and AutoGen that make these patterns concrete.
The synthetic data section reveals a critical engineering insight: you can't just prompt your way to System 2 reasoning. Papers on "bootstrapping reasoning from synthetic CoT data" and "distilling reasoning into smaller models" show that training on structured reasoning traces—essentially exposing models to worked examples of deliberation—produces better results than zero-shot prompting. This explains why approaches like Constitutional AI and RLHF with process rewards outperform simpler fine-tuning: they're teaching the model reasoning processes, not just answer patterns.
Gotcha
The repository's biggest limitation is also its defining characteristic: it's purely a reference index with minimal critical evaluation. You'll find links to 200+ papers and projects, but almost no guidance on which approaches actually work in production, which are academic curiosities, and which contradict each other. For example, the cognitive architecture section lists SOAR, ACT-R, OpenCog, LIDA, and Sigma without explaining that these represent fundamentally different theories of cognition—using them as interchangeable design patterns would be like mixing React's component model with Angular's dependency injection because both are "frontend frameworks."
The repository also suffers from recency bias and GitHub-centric perspective. Classical papers from cognitive science journals are underrepresented compared to recent arXiv uploads, and practical engineering considerations—latency, cost, error handling, evaluation metrics—are almost absent. There's no executable code, no benchmarks, no comparison of techniques on concrete tasks. If you're building a production reasoning system tomorrow, you'll need to read dozens of papers and implement your own experiments to figure out what actually matters. The repository gives you a map of the territory but doesn't tell you which paths lead to cliffs and which to treasure.
Verdict
Use if: You're researching AI reasoning systems, designing LLM agent architectures, or writing a literature review on cognitive architectures and need a structured starting point that spans classical cognitive science to cutting-edge LLM research. This repository will save you weeks of bibliography mining and help you avoid reinventing SOAR with extra steps. Also valuable if you're skeptical that prompting alone will achieve AGI and want to understand what deliberative reasoning actually requires from an architectural perspective. Skip if: You need ready-to-use code, detailed implementation guidance, or opinionated technical recommendations for production systems. This is a breadth-first survey, not a tutorial or framework. Also skip if you're looking for consensus—the repository presents competing approaches without synthesis, so you'll need domain expertise to evaluate tradeoffs yourself.