A practical guide to 3D convolutional models for video classification
Exploiting the time dimension as a new augmentation space, and other training tricks
1. Limitations of image-level convolutional neural networks
Convolutional models have (generally) dominated the deep learning for computer vision field. Their central component, the kernel, enjoys several benefits such as being parameter efficient and translation invariant with respect to image features. These models typically are trained at the level of images, usually a tensor of shape (C, W, H)
and training batches of (B, C, W, H).
For video-level classification, however, image-level training has a couple limitations.
Given a sequence of probabilities for a particular classification task, determining the range in which to declare a positive signal at the video level is nontrivial. Some post-processing or smoothing will have to be done. For example, if we’re trying to detect the prescence of a cat in a video, we may get a time series of image-level model outputs like the following;
Assuming the cat does lie within the range of positive signal we can see visually, this perfect image-level classifier still requires postprocessing to extract the time of classification. This can be done with an recurrent neural network or LSTM trained on the class probabilities, a simple filter, or any postprocessing method. The problem with this, however, is that building a postprocessing method that generalizes at the video level ends up being really hard.
When false positives occur (and they always will) the probabilities are generally quite high, since cross entropy allows optimizing both the direction and magnitude of the logits. Neural networks are overconfident. This makes smoothing out these high probability false positives difficult, especially if they occur over several adjacent frames.
An architecture that could work with less smoothing for classification across time segments but still take advantage of convolutional kernels would be nice, right?
2. Make your life easier, convolve across time
Note: We are using the notion of 2D convolutional neural networks to refer to image-based models and 3D convolutional neural networks to refer to video (or clip)-level models. In practice the input to 2D models is usually a color image, which is considered a three-dimensional volume. The input to a 3D model is really a sequence of three-dimensional images, which is four dimensional. The batched data is four and five dimensional, respectively.
Turns out we keep using the same thing we already have; a convolutional neural network. We add a dimension to the kernel to convolve across time. To understand the details behind this new kernel check out this paper. Practically speaking, this means we can make predictions across clips of our video of arbitrary length.
This idea is super useful, especially for simplifying postprocessing.
Across a video of N
frames, we’ll only need to smooth out N/T
predictions if we densely sample each frame! But we don’t need to sample every frame (adjacent frames contain most of the same information). With a fixed T we can split our video into K
chunks (often called clips) and uniformly sample T
frames across each chunk.
We could even uniformly sample T
frames from across a video, and make a single prediction. The point is, the temporal dimension is now completely flexible instead of being limited to a postprocessing step.
Take our cat classification example. To determine if a cat is in the video at all, we really only need a single uniformly sampled tensor from across the entire video. For more granular predictions, we’ll still need some smoothing, but in my experience having fewer predictions overall means proportionally less effort to generalize at the video-level.
3D convolutional neural network architectures often look structurally similar to their 2D counterparts

The popular SlowFast model fuses two pathways, one for temporal and one for spatial feature learning in order to better capture the benefits of temporal information without needing dense sampling along the time dimension.
3. Training tricks and implementation information
Batches to 3D models generally have the tensor shape (B, C, T, H, W),
although commonly the channel and time dimension are permuted (for example, the torchvideo library) so the batches are (B, T, C, H, W)
- batch, time, channel, height, width. That is, for the first batch element we have a list of T
images with C
channels.
Temporal augmentation such as random subsampling
In order to simulate different video speeds and jitter, as well as not allow your model to overfit to common temporal features, we commonly sample more than T
frames and then randomly downsample to get to the number we need.
When we don’t have a lot of examples for a particular class, this provides a lot of benefit to learning new temporal features.
Gradient accumulation to boost effective batch size
Since each batch contains T
images, your batch size will likely stay fairly small (4-8) on a standard GPU machine training, even with relatively small images of 256x256. The research on the effect of batch size is completely unresolved. However, my experience has been that both train and val loss are much smoother at an epoch level when batch size is higher. To simulate larger batch size under contrained GPU memory, we can use gradient accumulation to perform multiple backwards passes and then updating the model parameters on the sum of these gradients.
You can also increase the learning rate if you do this, too.
Label smoothing
This is a standard image model trick. Try with epsilon=0.1
AugMix for image augmentation
There are lots of augmentations that work decently well for image-level models. For 3D models, however, there are a lot of potential features and therefore super strong augmentation tends to make training difficult. AugMix uses some mildly fancy sampling to augment your training images without destroying your train distribution. The default settings have worked well for me.
Rotation augmentation
Horizontal flips, vertical flips and random rotations really help to boost your training set. This also simulates camera rotation, which is helpful especially if multiple training videos are shot at different rotations. Make sure to apply the same rotation to all images in the input tensor.
Similar to random rotations, the random perspective transformation simulates multiple camera angles.
Smaller learning rate
With or without gradient accumulation, the training dynamics of 3D models can be really noisy. Try a fairly low learning rate to start, such as 1e-6 or 5e-7 when using SGD, although I’ve found it’s okay to increase by an order of magnitude with Adam.
General note: a lot of model training tricks are for squeezing out more validation accuracy or training more efficiently. 3D models tend to be finicky and overfit easily, and a lot of these tricks are focused more on getting your validation loss to behave nicely.
Hope this helped!