Rolling Shutter vs Global Shutter: The Sensor Tradeoff in Egocentric Tracking

This is a follow-up to our essay on SLAM and VIO in egocentric data. One of the failure modes we described was fast head motion blurring frames. This post goes deeper into the sensor-level decision that makes that problem dramatically worse, and why the obvious answer is often not the right one.

How A Camera Reads The World

A camera is an array of photodetectors that converts light into electrical signals, which are digitized and stored as images. At a simplified level, there is one photodetector for each pixel, though modern sensors are more complicated. The complexity here hides in one question: how do you read those signals out?

To complete a full frame, the value stored in every capacitor corresponding to every pixel needs to be read. The ability to run the camera at higher frames per second is fundamentally bottlenecked by the number of readout operations that need to be performed.

There is a shortcut here. Instead of reading the entire array at once, you read one row at a time, replacing the oldest row with the newest in a continuous sweep. This is what rolling shutter does. While this increases achievable frame rates, it introduces a fundamental problem: not all rows of the image are exposed at the same time. Every row differs by a constant time offset in terms of when it was captured. The top of the frame and the bottom of the frame are literally looking at different moments in time.

A global shutter, on the other hand, exposes every pixel simultaneously. The entire frame represents a single instant, and there are no temporal skew between rows. The obvious conclusion is that a global shutter is better, and it is for temporal consistency. But if a global shutter were strictly superior, all cameras would use one. The nuances here matter deeply for egocentric SLAM.

Global shutter vs rolling shutter mechanism diagram

Credits: Rolling shutter: Modelling, Optimization and Learning

Why The World Runs On Rolling Shutters

A global shutter camera needs a capacitor physically adjacent to every photodetector. Every pixel needs its own storage element to hold the charge during the simultaneous readout. A rolling shutter needs only one row of capacitors, shared sequentially across all rows.

This is a massive difference in silicon real estate. In a global shutter sensor, the capacitors compete with the photodetectors for physical space on the chip. The fraction of the sensor area that actually captures light - the fill factor - is significantly lower. Less light captured per pixel means a lower signal-to-noise ratio. The quantum efficiency of global shutters is lower than that of rolling shutters.

This has direct consequences. Global shutter sensors convert a smaller fraction of incoming photons into a signal and perform measurably worse in low-light conditions. Kitchens at night, dimly lit cabinets, and shaded workspaces are the environments where this difference is most apparent.

The dynamic range - the range of light intensities a sensor can reliably distinguish from the brightest highlight to the darkest shadow - is also higher in rolling shutter sensors. A kitchen scene with a bright window and a dark cabinet interior pushes dynamic range hard. Rolling shutter handles this better because more of the sensor area is devoted to light capture rather than charge storage.

Global shutter sensors also cost 2 to 5 times more than equivalent-resolution rolling shutter sensors. At the scale of thousands of data collection devices, a $15 sensor becomes a $50 sensor, which fundamentally changes the unit economics of the entire collection operation.

Rolling shutters also achieve higher frame rates at the same resolution because the row-sequential readout is faster than reading the entire array simultaneously. For egocentric capture, where 30 or more frames per second is desirable to handle fast head motion, this advantage compounds.

This plays out in the real world exactly as you would expect. Rolling shutter dominates the consumer market: every iPhone camera, every Android camera, and most webcams and action cameras use rolling shutters. Global shutter is a premium product used in machine vision, industrial inspection, and high-end cinematography, where the cost is justified by the application.

Does Rolling Shutter Break SLAM?

The sensor tradeoffs between rolling shutter and global shutter collide directly with the state tracking problem we described in the previous essay.

Recall the scenario we highlighted: a person is performing a kitchen task and their head turns at up to 120 degrees per second. At that angular velocity, the temporal offset between the top row and the bottom row of a rolling shutter frame creates a measurable geometric distortion where vertical lines bend. The horizontal lines remain unaffected, but the spatial relationships between features in different rows of the same frame become inconsistent. These rows were captured at different moments, and the camera was rotating between those moments. This is the classic rolling shutter wobble that appears in consumer video of fast-moving scenes.

Visual SLAM systems extract point features - corners, edges, texture gradients - and match them across frames to estimate camera motion. The feature detector assumes that all features in a single frame share the same camera pose. With global shutter, this assumption is correct. With rolling shutter, it is violated by design: the features in different rows of the same frame correspond to slightly different camera poses.

The magnitude of this error scales directly with two factors: the angular velocity of the camera and the time difference between the first and last row readout. For a typical phone camera with a rolling shutter readout time of 15 to 30 milliseconds, a head rotation at 120 degrees per second introduces approximately 1.8 to 3.6 degrees of rotational offset between the top and bottom of the frame. At arm's length - roughly half a meter - this translates to 1.5 to 3 centimeters of apparent positional error between features at the top and bottom of the image.

The precision requirement from our last post: the end-effector position cannot be off by more than 1 to 2 centimeters. A single fast head turn on a rolling shutter can introduce more than 2 cm of error in a single frame - not from drift or IMU bias, but from the sensor itself reporting geometrically inconsistent data.

What Can Be Done About It

This section gets into the math of how rolling shutter compensation actually works. If you are here for the engineering tradeoff and not the derivation, skip ahead to “The Tradeoff.”

Let's go back to the fundamentals of what SLAM estimates and how it estimates them. We are trying to estimate a pose - 3D translation and 3D rotation - for every frame that comes up. What if we could estimate a pose for every row of every image? Remember, every row of the pixel was exposed at a different time, but if we had to do pose estimate with just one row of pixels, this isn't feasible - we need to detect and match features, a minimum of 5 matches for estimating the essential matrix, or a minimum of 3 points for a PnP problem. When optimization problems like these come up in literature, the time-tested solution is to either introduce some regularization, or introduce some smoothening. Rolling shutter SLAM usually resorts to smoothening.

There are a few ways to do this smoothing:

  1. Discrete motion
  2. Continuous motion

Discrete Motion

Just imagine two camera frames whose poses are known. If we assume a smooth motion between the two frames, how can we estimate the pose of each image row? Let's assume the motion was just a pure straight line motion. If the image had N rows, the frame for each image would be N equidistant points sampled along the line. But how do we represent a general motion? In order to get there, let's understand how we can represent a rotation matrix. If we rotate an angle Θ\Theta about an axis n\mathbf{n}, the rotation matrix can be represented by the following Rodrigues' formula:

R=(I+sinΘ[n]×+(1cosΘ)[n]×2)R = \left(I + \sin\Theta\,[\mathbf{n}]_{\times} + (1 - \cos\Theta)\,[\mathbf{n}]^2_{\times}\right)

We will not discuss its derivation here, but it's just a simple application of vector algebra. [n]×[\mathbf{n}]_{\times} is the skew-symmetric matrix corresponding to the vector n\mathbf{n}.

Let's assume that we were rotating with a constant angular velocity ω\omega (i.e., Θ=ωT\Theta = \omega T, where TT is the total time between the first row and the last row of the image). For the ii-th row in between we can always write:

R=(I+sin(ωsi)[n]×+(1cosωsi)[n]×2)R = \left(I + \sin(\omega s_i)\,[\mathbf{n}]_{\times} + (1 - \cos\omega s_i)\,[\mathbf{n}]^2_{\times}\right)

Where sis_i is the time between the first row and the ii-th row. Similarly, if we assume that the motion between the two frames had a linear velocity vv, then the translation at the ii-th row is:

t=vsit = v\, s_i

When we are solving the pose between two frames, we can modify the bundle adjustment function to solve for this smooth motion. A general bundle adjustment cost function is defined as follows:

R,t,X=argminR,t,Xi=1Nxiπ(K,R,t,Xi)2R^*, t^*, X^* = \arg \min_{R,\, t,\, X} \sum_{i=1}^{N} \left\| x_i - \pi(K, R, t, X_i) \right\|^2

Where we minimize the distance between a feature point and the projection of its associated 3D point onto the image plane by optimizing over rotation (RR), translation (tt) and the 3D points (XX). KK here is the camera intrinsics and π\pi is the projection function that maps the 3D point onto the image plane. We can modify the bundle adjustment cost function to solve for ω\omega, vv instead of RR and tt:

v,ω,X=argminv,ω,Xi=1Nxiπ(Xi,Pui)2v^*, \omega^*, X^* = \arg \min_{v,\, \omega,\, X} \sum_{i=1}^{N} \left\| x_i - \pi(X_i, P_{u_i}) \right\|^2

Where PuiP_{u_i} is the projection matrix corresponding to the ii-th feature point - the RR and tt for this feature point lying at row kk is constructed using the velocities ω\omega, vv and its scaling factor sks_k. Thus, we can solve for the velocities that solve a smooth motion between the two frames.

Notice that in the above case, we only solve for RR and tt for one discrete segment of motion - if there are NN images in succession, the N1N{-}1 segments of ω\omega, vv need not obey any smoothing property in general. The velocities can jump erratically as we transition from the last row of frame KK to the first row of frame K+1K{+}1 - this is quite opposite to the smooth motions that we see in physical reality.

How do we solve this in a cleaner way? What if we want to solve for a continuous function that can give pose values for all time stamps across the entire trajectory - this is the continuous modelling paradigm. First let's explain what a spline interpolation is. At its core the idea is simple - let's imagine a stream of 1D points (x) for which we know the desired values (y). How do we map a cubic function that flows through all these points in a smooth way? These points for which we know the desired values are called control points. For every two successive control points, we solve for a 3rd order function. But how do we solve for the 3rd order function with just 2 points - we make some assumption on velocity and acceleration, i.e., we impose the following constraints:

  1. The starting and ending value of the function should match the two control points
  2. The velocity (first order derivative) at the end of segment 1 should match the velocity at the start of segment 2
  3. We make a similar assumption on acceleration (second order derivatives are equal at the end of segment 1 and at the start of segment 2)
Cubic spline interpolation diagram

Note that (2) and (3) just follow some reasonable physical values we expect in reality - we expect the smooth motion at the end of segment 1 to continue into segment 2. Under these assumptions, we can solve for a spline function, i.e., a stream of cubic functions that connects every successive control point. When using the same philosophy for a rolling shutter SLAM system, the control points are some time stamps and we minimize a bundle adjustment cost on the spline parameterized pose values at those control points ω\omega, vv. This leads to a system of differential equations that can then be used to solve for the spline coefficients for every segment of the camera trajectory.

It is immediately obvious from the above explanation that a continuous trajectory modelling is more powerful but complicated, while a discrete trajectory model is easier but not as powerful. The representational power of continuous trajectories is quite beyond the scope of this article but it is worth pointing out here that continuous trajectories give you a way to control the optimizable parameter count for long continuous trajectories - the number and the frequency of control points dictate the parameter count.

For dedicated capture hardware as opposed to phones, global shutter sensors eliminate the problem entirely. There is a cost premium that may be justified for high-value data collection operations where tracking quality is paramount and dynamic motion is too high. The dynamic nature of egocentric SLAM pushes you towards global shutter, whereas the cost efficiency of rolling shutter pulls you towards mass adoption. We think there is a fundamental limit to a rolling shutter which cannot be compensated with deeper modeling as opposed to using a global shutter, but the design choice is not always obvious, especially because there are also other quality tradeoffs other than cost.

The Tradeoff

The choice between rolling shutter and global shutter for egocentric data collection is not a technical question with a clean answer. It is a product engineering tradeoff that sits at the intersection of physics, economics, and scale.

Rolling shutter gives you better image quality, lower cost, higher frame rates, and availability. There are some interesting product design and engineering tradeoffs here - e.g., at what per-device cost does the tracking quality improvement from global shutter pay for itself in reduced data rejection rates and higher downstream policy performance?

And it is one of many sensor-level decisions that most people building in this space are avoiding because they have never tried to maintain sub-centimeter tracking during a 120-degree-per-second head turn captured by a low-cost rolling-shutter sensor in a dimly lit space where the user is performing some interesting action.

For more details, reach us at abhishek@fpvlabs.ai

Follow us on X at @fpv_labs