bandit algorithm


Continuity

From Bandits to SPARCbandit: Toward Interpretable Learning

In May, development efforts shifted from low-level control to high-level policy learning, with the design and formalisation of SPARCbandit: a lightweight, explainable reinforcement learning agent tailored for planetary exploration. What began in late 2024 as a series of experiments with contextual multi-armed bandits evolved into a formal proposal and implementation framework suitable for conference submission. SPARCbandit - short for Stochastic Planning & AutoRegression Contextual Bandit - was designed to address the limitations of previous contextual bandit implementations. Where earlier versions were reactive and short-sighted, SPARCbandit introduced several critical features:

  • Eligibility Traces (TD(λ)) to propagate credit over time, enabling the agent to associate delayed rewards with prior decisions.
  • Adaptive Cost Tracking, using a baseline and autoregressive deviation filter to account for energy consumption, sensor drift, and terrain variance.
  • Latching Mechanism, allowing user-specified action preferences (e.g., preferred gaits) to remain in contention despite short-term suboptimality.
  • Planning via Dyna-style Updates, where real experiences are augmented with simulated transitions from a learned one-step model.
  • Action Selection with Cost-Aware Filtering, combining exploration (ε-greedy) and robustness under non-stationary conditions.

These features were incorporated into SPARCbandit v16, the most stable and refined version of the architecture to date. It supported both sequential learning (TD updates) and bandit-style decision-making, with Q-values parameterised using linear functions over Fourier basis features for interoperability. At the same time, I began preparing the associated conference paper for iSpaRo 2025, outlining the theoretical motivation, implementation details, and performance benchmarks of SPARCbandit. The emphasis was not on raw performance - Deep RL models would outperform it in that domain - but rather on transparency, resource-awareness, and deployability on constrained hardware like the Raspberry Pi 4 aboard Continuity. By the end of May, the core implementation had been tested on benchmark environments such as LunarLander and adapted to my own simulation framework. The algorithm showed stability, robustness to noise, and meaningful cost-sensitive behaviour - all while retaining traceable decision logic, which is essential for real-world deployment in mission-critical systems.