In a recent series I’ve been discussing the *lighthouse problem*: trying to identify the coordinates of a lighthouse at sea based on only sporadic recordings of its light hitting the shoreline. First, I walked through the powerful, but computationally expensive, purely Bayesian approach. Then I introduced the Kalman filter as an efficient alternative, and we saw how, even though it should be theoretically incapable of tackling that problem, it turned out to be easy to make it work very well with just a few small tweaks.

Even so, I’ve only demonstrated half of the Kalman filter’s power: the **update** step, which processes new data. The Kalman filter also includes, as part of its machinery, a **prediction** step, which takes the probability distribution as of timestep *k* and extrapolates a new distribution for timestep *k+1*. Because a lighthouse by its nature doesn’t move, that prediction step changed absolutely nothing, so I got away with neglecting it until now.

Today we’re going to replace the lighthouse with a moving *ship*. We’ll need to add prediction into our Kalman filter to keep up, and we’ll see how this, too, follows from the Bayesian probability rules.

By the way… I’m not actually obsessed with tracking down maritime objects! My motivation for this series is to familiarize you with how the Kalman filter approximates Bayesian reasoning, and the lighthouse problem is a testbed that is challenging-enough to keep it fun. But the problem is also more practical than it might seem. For example, it’s isomorphic to the challenge of locating the position of a radioactive tracer inside a body by using a detector array on the surface.1

So let’s reformulate the lighthouse problem to make a “ship” problem:

A ship is somewhere off a piece of straight coastline at position

α(t)along the shore and a distanceβout at sea. It is moving parallel to the shore at a constant, but unknown, velocityvm/s.It emits short highly collimated flashes at 1-second intervals but at random azimuths. These pulses are intercepted on the coast by photo-detectors that only record the time that a flash has occurred, but not the angle from which it came. Flashes so far have been recorded at positions {x₁…xₖ}. Where is the ship currently?

We could certainly make this problem even more complicated — the ship could move along a random vector; it could also accelerate; its acceleration could vary with time; we could have partial control of its acceleration; it could steer; bounce off obstacles; rise into the air; teleport unpredictably, and so on. But this already should be enough to introduce the idea of prediction, and then we can consider further mixups next time.

Here are the equations of motion for this ship:

*α(t) = α₀ + vt*— Moving at speed*v*along the shorelined

*v/*d*t = 0*— Speed doesn’t changed

*β/*d*t = 0*— Distance from shore doesn’t change

#### Bayes

Let’s quickly go through the Bayesian approach, so that we can compare how well the Kalman filter measures up to it. The computational difficulty is heating up as the new unknown parameter, *v*, introduces yet an additional dimension to consider. But as usual, we just need to start with a prior, then figure out what the parameters [*α*(*tₖ*)*, β,* *v*] say about the likelihood of observing data *xₖ*, and then rely on Bayes’ rule to do the hard work of flipping that around to get what the data implies about *α*(*tₖ*)*, β,* *v*.

Let’s start with Bayes’ theorem as always. Here is how it guides us to update on the latest measurement *x ₖ₊₁* to learn about the latest state

**y**= [

*α*(

*tₖ₊₁*)

*, β,*

*v*]:

And again, for those who don’t love to read math, here’s the English phrasing:

Or in even fewer words: * posterior = prior × likelihood*.

So to get our posterior describing the ship’s location, we need a prior and a likelihood. Let’s do it.

##### The once and future prior

Let’s keep the non-informative priors we used earlier: uniform for *α₀*, and log(*β*). Now we need a prior probability for *v*. I don’t think that the information in the problem gives us one unique choice, but remember, the most important thing about an uninformative prior is just to be sure we’re not prematurely ruling out possible true values. A quick search suggests the world maritime speed record was set in 1978 by Ken Warby’s *Spirit of Australia* at about 276 knots, or 511 km/hr, powered by a J34 fighter jet engine. At risk of understatement, our ship in question is *unlikely* to be going faster than that.

So for argument’s sake let’s make our prior for *v* Gaussian with a mean of 0 and a standard deviation of, say, ~250 km/hr (70 m/s). Combined with our priors for *α₀* and *β*, we have:

This is a good initial prior to start out with for updating on our very first data point *x₁*. **But now here’s the important part:** we need a prior for *every* update that we do, on *every* new data point, not just *t₀*. What about for *t₁*? For *t₂*? For *tₖ₊₁*?

Back when we were tracking a lighthouse, the only thing that was changing with time was our *knowledge* of the parameters, not their actual values. If a minute passed by with no new data, we would have no reason to change our beliefs about the lighthouse’s coordinates. So we were able to just recycle our posterior after updating on *x₁* to serve as our prior when updating on *x₂*. This gave us a very nice recurrence relationship.

With a moving ship, our posterior for *α*(*t₁*) no longer can double as a prior for *α*(*t₂*), because we *know* the ship’s position will have moved over that time interval. Even if our sensors lost power for a while and we had to extrapolate the ship’s position with *no* new data, we could do better than just guessing it stays still.

Fortunately, the system’s equations of motion show us how to make a much better guess: *α(t₂) = α*(*t₁*)* + v*Δ*t*. Plugging this into our prior allows us to look forward and adjust our probability distribution in accordance with *the way the system moves*:

To put this equation into words: whatever probability distribution we’d found as our posterior for *α*(*t₁*), we just shuffle it over sideways by *v*Δ*t* to get the prior for *α*(*t₂*). (We don’t know *v* for certain, so we’ll make up for that by brute force computation: iterating this process over all plausible velocities.2)

##### The likelihood

The direction of the light flash isn’t directly influenced by the ship’s speed, so supposing the ship’s next position [*α(tₖ₊₁*), *β*] is already enough to get the corresponding probability distribution of the next data point *xₖ₊₁*. Conditional on the next-timestep state, the likelihood is still just our old Cauchy equation.

##### And finally… lots of turning the crank

We’ve seen how Bayes’ rule splits up into two terms, representing the repetitive application of two procedures:

Generate a new prior probability by extrapolating forward from our old posterior, and

Generate a new posterior by updating on the new data

*xₖ₊₁*

Now getting the full Bayesian solution is just a matter of running that computation many, many, *many* times over a humongous grid of plausible positions and velocities.

I won’t easily be able to plot the entire 3D joint distribution, so I’ll collapse it down to just the distribution for the ship’s coordinates [*α(tₖ*)*, β*], and display *v* by only its marginal distribution.

I’ve added a colour gradient to the measurement histogram so that the recency of each measurement is visible. I don’t know about you, but I find it impressive that the algorithm can extract such a confident, accurate picture of the ship’s motion from such scattered data!

#### Kalman’s turn

It took about two hours for my Thinkpad to compute the above Bayesian solution. The Kalman filter reduces that to milliseconds.

The Kalman filter advances in two steps: *prediction*, then *update*. Last time, we saw how the Kalman filter’s update process comes directly from approximating Bayesian probabilities with a Gaussian. By now it should be no surprise that the prediction step does, too.

In order to get the prior probability for Bayes’ equation, we had to use the system dynamics to shift our old posterior distribution into a new prior for the next timestep. Since we weren’t *sure* what those dynamics are exactly, we just iterated the process over every possibility and turned the crank. The Kalman filter’s prediction step does this job while replacing brute force with finesse.

With the Kalman filter we’re representing our beliefs by a Gaussian, so we just need to keep track of a *mean* and an *uncertainty*. Working out what happens with the mean is pretty straightforward: our best guess for *α*(*t+*Δ*t*) would be to just take our guess for *α*(*t*) and shift it by *v*Δ*t*, using our best guess for *v*.

So what I *want* to write is: *μₐ,ₖ₊₁* = *μₐ,ₖ* + *μᵥ,ₖ*Δ*t*. And that would be correct, except that I was already using the symbol *μₐ,ₖ₊₁* for “the mean of *α* after the *update* on the *k+1th* data point”. We need a different symbol to represent the intermediate state that comes after the prediction, but before the update on new data. I’ll use ͞*μₐ,ₖ₊₁* to refer to “the mean value of *α* that is *predicted* for *tₖ₊₁*”. So, the state prediction for *α* is ͞*μₐ,ₖ₊₁* = *μₐ,ₖ* + *μᵥ,ₖ*Δ*t*.

We also need to apply this process to *the covariance *— the uncertainty. If you think about it, any uncertainty we have about *v* should cause a gradually-increasing uncertainty about the position *α*. Even if we knew *α*(*t₁*) with pinpoint accuracy, by time *t₂* our estimate would smear to a cloud if we didn’t know the speed of travel. As with all Gaussians, that cloud will be an ellipsoid shape. But it will be tilted in the *α-v *plane, because the errors should correlate: if the real *v* is higher than our best guess *μᵥ*, then the real *α(tₖ₊₁*) will be higher than our best guess *μₐ,ₖ₊₁* and vice-versa.

##### Enter the matrix

The Kalman filter has us write our system dynamics in terms of a *state transition matrix*, * F*, which expresses how the present state relates to the future state. This is part of what’s called a

*state-space*model.3 4

(If you’re wondering why we have log(

*β*) as a state variable instead of*β*,

When we transform our probability distribution with * F*, the new (predicted) mean evolves directly from the previous mean, via that state transition matrix:5 ͞

*=*

**μₖ₊₁***. Meanwhile the covariance undergoes the same transition, but it applies*

**Fμₖ***twice*, basically because variance is a squared quantity.6 7 The equation for the covariance of the prediction is ͞

**S**

*=*

**ₖ₊₁**

**F****S**

*.*

**ₖF**ᵀIn other words, the Kalman filter’s prediction is nothing magical or surprising; we just evolve our beliefs about the system state by following the very same dynamics that we believe govern *the system itself*. Because how would you make any better guess than that?

Then, after making the prediction, we update on the measurement just like we did in the previous article. With each timestep, the prediction grows the covariance slightly (as we’re always a bit more uncertain about the future than the present), and then the measurement shrinks it again.

**Here’s the Kalman filter at work**. The prediction step is shown in red, and the update step is in pink.

As it did last time, the Kalman filter tracks *extremely closely* to the ideal of direct Bayesian estimation. It’s tempting to think it’s just curve-fitting the Bayes output, but you have to remember it doesn’t actually have access to that data; both solutions are being computed independently. This might be especially surprising when you remember that we’re *still* voiding the warranty, with noise that is about as non-Gaussian as it gets.8

Source code is available on gitlab if you’d like to follow along.

Anyway, that’s basically the prediction step of the Kalman filter in a nutshell. There are still a few more features up its sleeve to handle more complicated systems with control inputs and process noise, but they don’t change the idea very much.

If the system is strongly nonlinear, then one matrix multiplication alone won’t be enough to pull a new prior from our old posterior, and the Kalman filter won’t give good results. But we can usually fix such limitations without changing the Bayesian spirit at the heart of the technique. Any transformation — linear or otherwise — that can forecast our knowledge from *t₁* to represent our most honest prior for the system state at *t₂* is fair game. Techniques like the Extended Kalman Filter and Unscented Kalman Filter extend the core ideas of the Kalman filter for the case of nonlinear systems.

#### Final thoughts

I know these short articles aren’t enough to offer a detailed mathematical understanding of how to *implement* a Kalman filter well for an arbitrary system, and they aren’t intended as tutorials. Fortunately, there are many other authors who have done much more complete, and much more respectable, jobs of that.9

I hope only to have planted some seeds for the *appreciation* of the Kalman filter’s usefulness, its power, its flexibility, and its deep connection to Bayesian probability, with as little math as I could get away with (though still quite a bit, sorry). I hope that this helps connect the dots for those who have struggled with more in-depth tutorials, or perhaps inspired you to “void the warranty” yourself, and experiment with pushing beyond a technique’s conventional limitations.10

And even if you never have any need to filter a noisy signal or estimate a parameter, even if you never have cause to use a Kalman filter for anything, I think that it’s still a concept worth knowing about. When an idea is so simple and so powerful, it tends to show up everywhere — especially in nature. (*Is the hippocampus a Kalman filter*?)

The Kalman filter, of course, only captures a small part of the depth of Bayesian reasoning. But in contrast to pure Bayesian logic, which tends to spiral off into infinite loops, halting problems, and other calculations that outlast the lifespan of the sun, the Kalman filter captures reality with incredible efficiency. That, in my opinion, has earned it the right to be respected as a basic building block of system theory, prediction, and even perception in general.

This post isn’t intended as professional engineering advice. If you *are* looking for professional engineering advice, please contact me with your requirements.

Applications of this technique that I’ve seen in the literature tend to use detectors on all sides. However by using the Cauchy distribution in the style of the lighthouse problem, a single flat detector plane should be sufficient to localize a radioactive particle in 3D space; depending on the source intensity, this could be accomplished pretty cheaply with a camera and a plastic scintillator sheet.

This requires discretizing the possible velocities into a grid, just as we did with position. As with any discretization process, there is a tradeoff to be made in terms of resolution vs. memory & computation time. There are ways to save some time and memory, such as refining the grid density to be denser in areas of high probability and sparser in areas of low probability.

*State-space form* is, basically, taking a system described by differential equations and, at least if that system is linear, organizing those equations in the following traditional format:

where **x** is called the *state vector* and contains all the data about the internal state of the system; **u** is the *input vector*, containing whatever inputs might be applied to the system; **y** is the *output vector*, containing whatever important properties we care about extracting from the system. (Note that in my articles I have been using ‘*x ’* to denote sensor measurements, not state, following the phrasing of the lighthouse problem).

**A, B, C**, and

**D**are matrices that describe all the connections between the state variables, their rates of change, and the inputs and outputs. If

**A, B, C**, and

**D**do not vary with time, the system is called

*time-invariant*.

This ship system has no inputs, so **B** and **D** are empty. If we’re interested in displaying the ship’s full state [*α*(*t*)*, β,* *v*], then **y** = **x**, so **C** is a 3x3 identity matrix. And last but not least, the system matrix **A** contains our system’s equations of motion:

Representing differential equations in state-space form is popular, compact, and versatile format. It serves as a common “user interface” to a lot of techniques and computer algorithms, including the Kalman filter. So the first step in many controls and modelling problems is to get a state-space representation of the system.

That said, there are *some* systems that are difficult to express this way. Nonlinear systems require relaxing the formatting constraints a little bit. Fractional-order systems, even when linear and time-invariant, sort of break the entire idea behind this approach, and are easier to just express as a list of transfer functions.

The state transition matrix **F** is essentially a discrete-time version of **A**. Rather than represent the relationship between * x(t)* and its rate-of-change, it represents the relationship between

*and its subsequent value*

**x**(tₖ)**x**(

*tₖ₊₁*). For time-invariant systems it can be calculated from

**A**by the

*matrix exponential*:

**F**= expm(

**A**Δ

*t*).

If we had data about any control inputs to our system **u*** ₖ*, we could incorporate that data into our prediction by expanding the equation a little bit:

*=*

**μₖ₊₁***+*

**Fμₖ****G**

*. The effect of the inputs*

**uₖ****u**

*on the state variables are quantified by*

**ₖ****G**, the

*input transition matrix*.

**G**serves the same role in the discrete-time model that

**B**serves in the continuous-time model, just like

**F**relates to

**A**.

To elaborate, if *X* is a random variable (such as our state vector) over which we have some probability distribution, and * F* is a linear transformation applied to

*X*, and E(

*X*) denotes the “expectation of

*X*” (which is another way of saying the

*mean*of

*X*, which we can also write

*μ*), then:

E(

) =**F**XE(**F***X*) =**F**μcov(

) = E( (**F**X) (**F**X -**F**μ)ᵀ) =**F**X -**F**μE( (**F***X - μ*) (*X - μ*)ᵀ)ᵀ =**F**cov(**F***X*)ᵀ**F**

But I think “*the transformation applies twice, because variance is a squared quantity*” is a bit more intuitive and much easier to remember.

By the way, we can also add in an fudge factor for additional uncertainty, called *process noise*, representing both unmeasured external disturbances as well as any further doubts we might have in the model itself. This just adds a little extra to the covariance with each prediction step. For example, we could use this if we thought the ship velocity might be actually varying a bit with time. Then even if we’ve pinpointed the velocity at time *t₁* with our measurements, we will lose some confidence in our forecast of that velocity for time *t₂*, as we ought to.

When a model is built on strong mathematical foundations, it might not make much sense to include additional process noise. For example, adding a process noise term to the covariance of *α* would suggest that (somehow?!) we aren’t confident that the *velocity* is the only thing that could contribute to *α* changing over time. Since the change in *α* over time *is the velocity by definition*, that seems like a mistake. If we wanted to represent a little extra uncertainty in the ship’s position for whatever reason, the measurement noise term might be a more sensible place to put it.

In fact it’s so far from Gaussian that even the *central limit theorem itself* doesn’t save us; while most random variables will eventually converge to Gaussian when you average them all together, averaging together *any* number of Cauchy-distributed variables *still* just leaves you with a Cauchy distribution no matter how much data you collect.

It’s actually surprising how much literature exists about the Kalman filter specifically, *especially* tutorials and derivations, and a lot of it is exceptionally well-presented and approachable. I really hesitated to write a series about it for that reason, because I didn’t want to add noise to an already clear signal. But I plan to refer to the Kalman filter a fair bit in future articles, and it seemed negligent to not at least provide a basic and intuitive, if slightly incomplete, foundation.

This is why I haven’t gone deep into the weeds on control inputs, process noise, how to appropriately derive the measurement covariance matrix, and so on. What I *really* want to cover here is not rote tutorials or dusty proofs, but rather less-trodden ground. Interesting questions that lurk in the night, ready to strike curious-minded engineers, whose answers are frustratingly hard to find on the internet or in literature.

(Rather like attaching a fighter jet engine to a boat in order to claim the world speed record.)

I'm wondering if in some situations, a Kalman filter simplifies to equations that are easy to implement? What do degenerate forms of it look like?