Skip to content

Latest commit

 

History

History
77 lines (39 loc) · 12.4 KB

File metadata and controls

77 lines (39 loc) · 12.4 KB

Sensory-Motor Inference Engines for Reward-Free Agents

Motivation and problem framing

The core research aim you describe—an agent that learns a world model, explores without external reward shaping, and acquires increasingly robust control—sits at the intersection of predictive coding, the free-energy principle, active inference, and intrinsic motivation research. citeturn0search5turn4search12turn2search1turn2search2

A key failure mode you identified is real and widely discussed: if an agent’s only imperative is to reduce short-term sensory prediction error (or “surprise” naïvely construed), it can converge on low-variance, low-information states (the philosophical “dark room” style critique). citeturn1search4turn1search8

The active-inference literature’s central claim is that behavior (action selection) and perception/learning can be derived from the same variational objective, where action is not optimized by reward maximization, but by minimizing (expected) free energy under a generative model that includes both sensory consequences and preferences/priors. citeturn4search12turn0search2turn1search5turn1search1

Predictive coding and the free-energy foundation

Predictive coding is commonly framed as hierarchical inference in a generative model: higher levels send predictions downward; lower levels return prediction errors upward; internal states are updated to reduce those errors. citeturn0search0turn0search5turn0search9

Early computational accounts (e.g., a classic visual-cortex model by entity["people","Rajesh Rao","neuroscientist"] and entity["people","Dana Ballard","computer vision researcher"]) formalized how feedback predictions and feedforward residuals can reproduce receptive-field phenomena and support efficient inference. citeturn0search0turn0search8

A major unifying move in this tradition is entity["people","Karl Friston","free energy principle"]’s proposal that predictive coding can be derived as (approximate) gradient descent on variational free energy—linking “prediction error minimization” with Bayesian model evidence maximization. citeturn0search5turn0search9turn4search1

This matters for your project because it reframes “predictive coding” as more than next-step prediction: it is a general-purpose inference-and-learning mechanism in a hierarchical generative model, which can be extended to include action. citeturn0search5turn4search12turn1search5

Active inference and intrinsic exploration via expected free energy

Active inference extends predictive coding by treating action as part of the same inferential process: perception updates predictions to explain sensory data, while action changes the world (and thus sensations) to make sensory data conform to predictions. citeturn4search12turn4search4turn4search2

A widely cited formulation is “predictions, not commands” in motor control: descending signals encode proprioceptive predictions; peripheral reflex arcs (or low-level control loops) reduce proprioceptive prediction errors, producing movement. citeturn4search2turn4search12

For planning, active inference uses expected free energy (EFE) to evaluate policies. Critically for your “reward-free exploration” requirement, EFE decomposes into terms that correspond to extrinsic value (alignment with preferred outcomes) and epistemic value (expected information gain / uncertainty reduction). citeturn0search2turn4search15turn1search1turn1search17

This epistemic term is where “novelty seeking” can be made principled: in the EFE framing, information gain is not an add-on bonus but a consequence of minimizing expected free energy. citeturn0search2turn4search11turn1search5

Several comparative/survey-style papers make the specific point that active inference can produce information-seeking behavior in the absence of an explicit reward signal, because epistemic value drives exploratory action to improve the agent’s model. citeturn5search20turn0search2turn1search5

A further refinement is the “risk vs ambiguity” view: expected free energy can be written in a form that relates to risk (divergence from preferred outcomes) and ambiguity (expected uncertainty in observations given states), clarifying how agents may trade off goal satisfaction and uncertainty reduction. citeturn1search1turn1search17

Why pure “predict-the-next-observation” collapses and how the literature addresses it

The intuition “a pure predictor will do nothing to keep inputs predictable” is closely related to the free-energy principle’s “dark room problem,” which challenges the idea that surprise minimization alone could explain adaptive behavior. citeturn1search4turn1search8

Responses in the active inference literature emphasize that agents do not minimize surprise with respect to arbitrary sensory states; they minimize free energy relative to a structured generative model that includes prior expectations/preferences about viable states (e.g., not being in a dark, atypical, non-ecological niche), plus epistemic imperatives that motivate information seeking. citeturn1search0turn5search17turn1search1

There is also active debate about details of the action story in predictive processing—e.g., whether action primarily “quenches” prediction error or “prevents” it by shaping sampling—highlighting that implementation choices can reflect different theoretical emphases even within the same broad framework. citeturn4search14turn4search12

The practical takeaway for your R&D framing is that “predictive coding alone” can look passive if it is implemented as myopic next-input prediction without (i) epistemic value / information gain, (ii) priors/preferred outcomes (even minimal viability priors), and/or (iii) an action model that couples policy to future observations. citeturn0search2turn5search17turn1search4turn4search12

Intrinsic motivation mechanisms that align with your “novelty + control” thesis

Intrinsic motivation research in ML and developmental robotics has long explored mechanisms that generate exploration without externally specified rewards. A foundational overview in RL terms is associated with entity["people","Andrew G. Barto","reinforcement learning researcher"], while developmental robotics work by entity["people","Pierre-Yves Oudeyer","developmental robotics researcher"] and colleagues emphasizes curiosity and “learning progress” as drivers of open-ended skill acquisition. citeturn2search1turn2search2turn2search10

A major thread, associated with entity["people","Jürgen Schmidhuber","machine learning researcher"], frames curiosity as reward for improvement in prediction/compression (sometimes described as learning progress or compression progress), not merely for being surprised. citeturn1search2turn1search10turn6search8

In mainstream deep-RL practice, many “curiosity” systems operationalize novelty as prediction error (or related surrogates). For example, entity["people","Deepak Pathak","machine learning researcher"] et al. propose intrinsic reward based on forward-model prediction error in a learned feature space to drive exploration when extrinsic rewards are sparse or absent. citeturn1search3turn1search7

However, a known pitfall of “prediction error as novelty” is the noisy-TV problem: if an observation source is stochastic and unlearnable, naive curiosity can be trapped chasing irreducible error (confusing aleatoric uncertainty with epistemic uncertainty). This failure mode is discussed explicitly in work connected to entity["people","Yuri Burda","machine learning researcher"] and also in later analyses of noise-robust exploration. citeturn6search2turn6search10turn6search8turn6search0

Empowerment is the intrinsic motivation concept most directly aligned with your “keep options open / gain control to enable exploration” intuition. In the standard definition (introduced by entity["people","Alexander S. Klyubin","empowerment researcher"], entity["people","Daniel Polani","computer scientist"], and entity["people","Chrystopher L. Nehaniv","computer scientist"]), empowerment is the channel capacity (max mutual information) between an agent’s actions (over a horizon) and its future sensed states—an explicitly agent-centric measure of potential influence/control. citeturn0search7turn2search16

The empowerment literature also highlights that “goal-agnostic” objectives can have surprising side effects in multi-agent settings; recent work shows empowerment maximization for one party can reduce another’s agency (“disempowerment”), which is relevant if your long-term ambitions include social/multi-agent scenarios. citeturn5search0turn5search3

Conceptually, active inference’s epistemic value (information gain) and empowerment (control capacity) are different quantities—one is about reducing uncertainty in beliefs/models; the other is about maximizing potential influence—but they can both yield exploratory behavior and both can be expressed in information-theoretic terms. citeturn0search2turn5search17turn0search7turn1search1

Existing implementations and practical approximations relevant to a NumPy proof-of-concept

For “build it from scratch in NumPy” pragmatics, the most implementable form of active inference in a small proof-of-concept is often the discrete-state, discrete-time POMDP formulation: beliefs are vectors; likelihoods and transitions are matrices/tensors; inference becomes message passing / variational updates; planning becomes evaluating EFE over short policy horizons. citeturn3search18turn1search5turn1search1

A particularly relevant reference implementation is the pymdp ecosystem: it is explicitly a NumPy-based library for simulating active inference agents in discrete state spaces, with a peer-reviewed software paper and tutorials that build an agent “from scratch” in a grid-world and demonstrate epistemic (curiosity-like) behavior. citeturn3search0turn3search1turn3search9

In the continuous-time/cortical predictive-coding lineage, there are also process-theory style accounts that map variational objectives to neuronal message passing and action selection, which can inform how you translate “predictive coding foundations” into concrete update equations. citeturn1search5turn4search12turn0search5

Modern expository work comparing active inference to RL highlights a key “engineering translation”: active inference planning can be understood as solving a particular kind of entropy-regularized POMDP where prior preferences replace explicit rewards and epistemic value replaces ad hoc exploration bonuses—useful when you want principled exploration without reward design. citeturn5search20turn0search2turn1search1

Simple environments and evaluation considerations for reward-free exploration

For an R&D testbed, your instinct to use a simple environment is aligned with the literature: early active inference demos typically use small POMDP/gridworld tasks precisely because they make generative-model specification and epistemic planning transparent. citeturn3search1turn3search5

If you still want a standard API wrapper to swap environments easily, the current maintained successor to entity["company","OpenAI","ai company"]’s Gym is Gymnasium, maintained by the entity["organization","Farama Foundation","rl tooling org"], with documentation and an associated paper describing it as the actively maintained fork/successor to Gym. citeturn3search2turn3search10turn3search16

For “reward-free” evaluation, the research landscape suggests separating at least three measurable outcomes: (i) predictive model quality (e.g., log likelihood / prediction error under the learned generative model), (ii) epistemic behavior (information gain / uncertainty reduction over hidden states or model parameters), and (iii) controllability/option-preservation proxies (empowerment estimates or reachable-state diversity). These map directly onto the epistemic/extrinsic decomposition of EFE and empowerment’s channel-capacity definition. citeturn0search2turn1search1turn0search7