View on GitHub

State Space Models & Kalman Filter

Along comes Mamba, an evolution in sequence models based on state space models

Kalman Filter is discussed in Chapter 7 of my book. As my book was published in the last quarter of 2023 and mamba arrived in 2024, the chapter doesn’t discuss mamba models. 👇

The goal of a state space model is to infer information about the state variables of a dynamic system, given the observations. The algorithm underlying SSMs is Kalman Filter.

State space models have their origins in control systems engineering. Underpinning SSMs are two equations - one describes the internal dynamics of a system that aren’t directly observables, and the other describes how the internal dynamics relate to observable results. This formulation is extremely adaptable for a wide variety of multivariate time-series data.

SSMs

State space models (SSMs) are a class of algorithms used to make predictions about dynamic systems by modeling how their internal states evolve over time through differential equations. SSMs traditionally are used in control theory. Real-world data is discrete (recurrent) and discretization is one of the important (if not the most important) steps in SSM.

While deterministic dynamics in discrete time can be handled by discretized ODEs/automata, stochastic dynamics of systems is tackled by SSMs. Kalman filter is the algorithm in SSMs to study the state variables of the system evolving with time.

Bayesian SSMs are typically used in macroeconometrics.

The state space representation of a time-series problem is a sequential analysis framework that typically includes tasks like filtering and smoothing.

Mamba

Mamba is a neural network architecture, derived from SSMs, used for language modeling and other sequence modeling tasks. The Mamba architecture’s fast inference speed and computational efficiency, particularly for long sequences, make it the first competitive alternative to the transformer architecture for autoregressive LLMs. It is argued in the paper providing the architecture that a fundamental problem of sequence modeling is compressing context into a smaller state. It requires content-aware reasoning to be able to memorize the relevant tokens and filter out the irrelevant ones, and to know when to produce the correct output in the appropriate context.

mamba

The procedure (x in the figure) following the selective SSM refers to element-wise multiplication, rather than standard dot product. Here seletion is a means of compression.

Mamba models are perhaps the first deep learning architecture to rival the efficacy of transformer models on the task for which they are originally known, which is language modeling. Instead of token-to-token attention like in transformer, mamba uses selective SSMs that learn how to compress very long context tokens. This yields linear time scaling as opposed to quadratic time in traditional transformers.