The core research aim you describe—an agent that learns a world model, explores without external reward shaping, and acquires increasingly robust control—sits at the intersection of predictive coding, the free-energy principle, active inference, and intrinsic motivation research. citeturn0search5turn4search12turn2search1turn2search2
A key failure mode you identified is real and widely discussed: if an agent’s only imperative is to reduce short-term sensory prediction error (or “surprise” naïvely construed), it can converge on low-variance, low-information states (the philosophical “dark room” style critique). citeturn1search4turn1search8
The active-inference literature’s central claim is that behavior (action selection) and perception/learning can be derived from the same variational objective, where action is not optimized by reward maximization, but by minimizing (expected) free energy under a generative model that includes both sensory consequences and preferences/priors. citeturn4search12turn0search2turn1search5turn1search1
Predictive coding is commonly framed as hierarchical inference in a generative model: higher levels send predictions downward; lower levels return prediction errors upward; internal states are updated to reduce those errors. citeturn0search0turn0search5turn0search9
Early computational accounts (e.g., a classic visual-cortex model by entity["people","Rajesh Rao","neuroscientist"] and entity["people","Dana Ballard","computer vision researcher"]) formalized how feedback predictions and feedforward residuals can reproduce receptive-field phenomena and support efficient inference. citeturn0search0turn0search8
A major unifying move in this tradition is entity["people","Karl Friston","free energy principle"]’s proposal that predictive coding can be derived as (approximate) gradient descent on variational free energy—linking “prediction error minimization” with Bayesian model evidence maximization. citeturn0search5turn0search9turn4search1
This matters for your project because it reframes “predictive coding” as more than next-step prediction: it is a general-purpose inference-and-learning mechanism in a hierarchical generative model, which can be extended to include action. citeturn0search5turn4search12turn1search5
Active inference extends predictive coding by treating action as part of the same inferential process: perception updates predictions to explain sensory data, while action changes the world (and thus sensations) to make sensory data conform to predictions. citeturn4search12turn4search4turn4search2
A widely cited formulation is “predictions, not commands” in motor control: descending signals encode proprioceptive predictions; peripheral reflex arcs (or low-level control loops) reduce proprioceptive prediction errors, producing movement. citeturn4search2turn4search12
For planning, active inference uses expected free energy (EFE) to evaluate policies. Critically for your “reward-free exploration” requirement, EFE decomposes into terms that correspond to extrinsic value (alignment with preferred outcomes) and epistemic value (expected information gain / uncertainty reduction). citeturn0search2turn4search15turn1search1turn1search17
This epistemic term is where “novelty seeking” can be made principled: in the EFE framing, information gain is not an add-on bonus but a consequence of minimizing expected free energy. citeturn0search2turn4search11turn1search5
Several comparative/survey-style papers make the specific point that active inference can produce information-seeking behavior in the absence of an explicit reward signal, because epistemic value drives exploratory action to improve the agent’s model. citeturn5search20turn0search2turn1search5
A further refinement is the “risk vs ambiguity” view: expected free energy can be written in a form that relates to risk (divergence from preferred outcomes) and ambiguity (expected uncertainty in observations given states), clarifying how agents may trade off goal satisfaction and uncertainty reduction. citeturn1search1turn1search17
The intuition “a pure predictor will do nothing to keep inputs predictable” is closely related to the free-energy principle’s “dark room problem,” which challenges the idea that surprise minimization alone could explain adaptive behavior. citeturn1search4turn1search8
Responses in the active inference literature emphasize that agents do not minimize surprise with respect to arbitrary sensory states; they minimize free energy relative to a structured generative model that includes prior expectations/preferences about viable states (e.g., not being in a dark, atypical, non-ecological niche), plus epistemic imperatives that motivate information seeking. citeturn1search0turn5search17turn1search1
There is also active debate about details of the action story in predictive processing—e.g., whether action primarily “quenches” prediction error or “prevents” it by shaping sampling—highlighting that implementation choices can reflect different theoretical emphases even within the same broad framework. citeturn4search14turn4search12
The practical takeaway for your R&D framing is that “predictive coding alone” can look passive if it is implemented as myopic next-input prediction without (i) epistemic value / information gain, (ii) priors/preferred outcomes (even minimal viability priors), and/or (iii) an action model that couples policy to future observations. citeturn0search2turn5search17turn1search4turn4search12
Intrinsic motivation research in ML and developmental robotics has long explored mechanisms that generate exploration without externally specified rewards. A foundational overview in RL terms is associated with entity["people","Andrew G. Barto","reinforcement learning researcher"], while developmental robotics work by entity["people","Pierre-Yves Oudeyer","developmental robotics researcher"] and colleagues emphasizes curiosity and “learning progress” as drivers of open-ended skill acquisition. citeturn2search1turn2search2turn2search10
A major thread, associated with entity["people","Jürgen Schmidhuber","machine learning researcher"], frames curiosity as reward for improvement in prediction/compression (sometimes described as learning progress or compression progress), not merely for being surprised. citeturn1search2turn1search10turn6search8
In mainstream deep-RL practice, many “curiosity” systems operationalize novelty as prediction error (or related surrogates). For example, entity["people","Deepak Pathak","machine learning researcher"] et al. propose intrinsic reward based on forward-model prediction error in a learned feature space to drive exploration when extrinsic rewards are sparse or absent. citeturn1search3turn1search7
However, a known pitfall of “prediction error as novelty” is the noisy-TV problem: if an observation source is stochastic and unlearnable, naive curiosity can be trapped chasing irreducible error (confusing aleatoric uncertainty with epistemic uncertainty). This failure mode is discussed explicitly in work connected to entity["people","Yuri Burda","machine learning researcher"] and also in later analyses of noise-robust exploration. citeturn6search2turn6search10turn6search8turn6search0
Empowerment is the intrinsic motivation concept most directly aligned with your “keep options open / gain control to enable exploration” intuition. In the standard definition (introduced by entity["people","Alexander S. Klyubin","empowerment researcher"], entity["people","Daniel Polani","computer scientist"], and entity["people","Chrystopher L. Nehaniv","computer scientist"]), empowerment is the channel capacity (max mutual information) between an agent’s actions (over a horizon) and its future sensed states—an explicitly agent-centric measure of potential influence/control. citeturn0search7turn2search16
The empowerment literature also highlights that “goal-agnostic” objectives can have surprising side effects in multi-agent settings; recent work shows empowerment maximization for one party can reduce another’s agency (“disempowerment”), which is relevant if your long-term ambitions include social/multi-agent scenarios. citeturn5search0turn5search3
Conceptually, active inference’s epistemic value (information gain) and empowerment (control capacity) are different quantities—one is about reducing uncertainty in beliefs/models; the other is about maximizing potential influence—but they can both yield exploratory behavior and both can be expressed in information-theoretic terms. citeturn0search2turn5search17turn0search7turn1search1
For “build it from scratch in NumPy” pragmatics, the most implementable form of active inference in a small proof-of-concept is often the discrete-state, discrete-time POMDP formulation: beliefs are vectors; likelihoods and transitions are matrices/tensors; inference becomes message passing / variational updates; planning becomes evaluating EFE over short policy horizons. citeturn3search18turn1search5turn1search1
A particularly relevant reference implementation is the pymdp ecosystem: it is explicitly a NumPy-based library for simulating active inference agents in discrete state spaces, with a peer-reviewed software paper and tutorials that build an agent “from scratch” in a grid-world and demonstrate epistemic (curiosity-like) behavior. citeturn3search0turn3search1turn3search9
In the continuous-time/cortical predictive-coding lineage, there are also process-theory style accounts that map variational objectives to neuronal message passing and action selection, which can inform how you translate “predictive coding foundations” into concrete update equations. citeturn1search5turn4search12turn0search5
Modern expository work comparing active inference to RL highlights a key “engineering translation”: active inference planning can be understood as solving a particular kind of entropy-regularized POMDP where prior preferences replace explicit rewards and epistemic value replaces ad hoc exploration bonuses—useful when you want principled exploration without reward design. citeturn5search20turn0search2turn1search1
For an R&D testbed, your instinct to use a simple environment is aligned with the literature: early active inference demos typically use small POMDP/gridworld tasks precisely because they make generative-model specification and epistemic planning transparent. citeturn3search1turn3search5
If you still want a standard API wrapper to swap environments easily, the current maintained successor to entity["company","OpenAI","ai company"]’s Gym is Gymnasium, maintained by the entity["organization","Farama Foundation","rl tooling org"], with documentation and an associated paper describing it as the actively maintained fork/successor to Gym. citeturn3search2turn3search10turn3search16
For “reward-free” evaluation, the research landscape suggests separating at least three measurable outcomes: (i) predictive model quality (e.g., log likelihood / prediction error under the learned generative model), (ii) epistemic behavior (information gain / uncertainty reduction over hidden states or model parameters), and (iii) controllability/option-preservation proxies (empowerment estimates or reachable-state diversity). These map directly onto the epistemic/extrinsic decomposition of EFE and empowerment’s channel-capacity definition. citeturn0search2turn1search1turn0search7