A Constraint-Aware Bandit

Posted in Roadmap by Giorgio on 28 February 2025

In February, the control architecture for Continuity took a substantial step forward with the formal integration of the bandit-based decision layer into the QuectoFSM state machine framework. This was the first implementation of what I termed Constrained Local Learning (CLL): a structure where bandit-based exploration is actively gated by contextual rules, and actions are only permitted when predefined safety and state conditions are satisfied. The multi-armed bandit (MAB) component, now in version 3, moved beyond basic reward selection and hard-coded action cycling. Instead of optimising only low-level parameters (e.g., step height or cycle duration), the new system could also select from a list of gait patterns, each encoded as an arm with an associated performance history. To regulate instability, the bandit’s reward function was overhauled. The original inverse formulation:

enter image description here

was replaced by a log-scaled penalty system, better suited for sharply discouraging large attitude deviations while maintaining sensitivity around small corrections:

enter image description here

enter image description here

This improved convergence and reduced erratic gait switches during transitions. QuectoFSM, acting as the high-level supervisor, dictated when the bandit was allowed to operate based on rules about sensor state, reward stability, and goal proximity. When active, the bandit used an ε-greedy strategy with updated exploration dynamics. Rule-based monitoring also included:

  • Patience thresholds for reward degradation,
  • Reset triggers for simulation stagnation or divergence,
  • Episode cutoffs based on step count or orientation instability.

At the same time, I began developing the mathematical model required for full Jacobian-driven reorientation. The system collected CoM shift data from all possible leg positions (via offline simulation) and trained a degree-2 polynomial regression model using Scikit-Learn. This enabled the controller to predict the centre of mass for any given leg configuration in real time. The regression model was tested on both my laptop and the Raspberry Pi 4 on Continuity. Even on constrained hardware, inference remained comfortably below the 10 ms control budget, making it viable for onboard execution during future hardware trials.