ResNet - Bottleneck Architecture
The Computational Challenge of Going Deep
As networks grow deeper, the computational cost of each residual block becomes a critical concern. A basic block with two $ 3 \times 3 $ convolutions at 256 channels costs:
$$ \text{FLOPs}_{\text{basic}} = 2 \times (256 \times 256 \times 3 \times 3) \times H \times W = 1{,}179{,}648 \times H \times W $$For ResNet-50 with 16 blocks at 256 channels on a $ 14 \times 14 $ feature map, this is roughly 3.7 billion FLOPs just for one stage. Scaling to deeper networks (101, 152 layers) with basic blocks would be computationally prohibitive.
The bottleneck block solves this by introducing a “compress, process, expand” pattern that reduces computation by a factor of roughly 8 while maintaining or even increasing representational power.
The Bottleneck Architecture
The bottleneck block uses three convolutions instead of two, with a channel compression in the middle:
$$ y = F_{\text{bottleneck}}(x) + x $$where $ F_{\text{bottleneck}} $ consists of:
Layer 1: $ 1 \times 1 $ convolution (compress)
$$ h_1 = \text{ReLU}(\text{BN}(W_1 \cdot x)) $$Reduces channels from $ C $ to $ C/4 $. For example, 256 channels to 64.
Layer 2: $ 3 \times 3 $ convolution (process)
$$ h_2 = \text{ReLU}(\text{BN}(W_2 * h_1)) $$Performs spatial processing at the reduced channel count. The $ 3 \times 3 $ kernel captures spatial patterns, but at $ C/4 $ channels instead of $ C $.
Layer 3: $ 1 \times 1 $ convolution (expand)
$$ h_3 = \text{BN}(W_3 \cdot h_2) $$Expands channels back from $ C/4 $ to $ C $.
Note: no ReLU after this layer (the ReLU comes after the addition with the skip connection).
Skip connection and activation:
$$ y = \text{ReLU}(h_3 + x) $$Why the Bottleneck Saves Computation
The key insight is that the expensive $ 3 \times 3 $ convolution operates on a compressed representation with $ C/4 $ channels instead of $ C $ channels.
Basic block FLOPs (two $ 3 \times 3 $ convolutions at $ C $ channels):
$$ \text{FLOPs}_{\text{basic}} = 2 \times C^2 \times 9 \times H \times W = 18C^2 HW $$Bottleneck block FLOPs:
• $ 1 \times 1 $ compress ($ C \rightarrow C/4 $): $ C \times C/4 \times 1 \times HW = C^2 HW/4 $
• $ 3 \times 3 $ process ($ C/4 \rightarrow C/4 $): $ (C/4)^2 \times 9 \times HW = 9C^2 HW/16 $
• $ 1 \times 1 $ expand ($ C/4 \rightarrow C $): $ C/4 \times C \times 1 \times HW = C^2 HW/4 $
$$ \text{FLOPs}_{\text{bottleneck}} = C^2 HW \left(\frac{1}{4} + \frac{9}{16} + \frac{1}{4}\right) = C^2 HW \times \frac{17}{16} \approx 1.06 C^2 HW $$Ratio:
$$ \frac{\text{FLOPs}_{\text{basic}}}{\text{FLOPs}_{\text{bottleneck}}} = \frac{18}{1.06} \approx 17 $$The bottleneck block is roughly 17 times cheaper than the basic block at the same channel count. This massive savings is what enables ResNet-50, 101, and 152 to be practical.
Worked Example: Computational Savings
Consider a bottleneck block at stage 3 with $ C = 256 $ channels and $ H \times W = 14 \times 14 $:
$ 1 \times 1 $ compress ($ 256 \rightarrow 64 $):
$$ 256 \times 64 \times 1 \times 196 = 3{,}211{,}264 \text{ FLOPs} $$$ 3 \times 3 $ process ($ 64 \rightarrow 64 $):
$$ 64 \times 64 \times 9 \times 196 = 7{,}225{,}344 \text{ FLOPs} $$$ 1 \times 1 $ expand ($ 64 \rightarrow 256 $):
$$ 64 \times 256 \times 1 \times 196 = 3{,}211{,}264 \text{ FLOPs} $$Total bottleneck: $ 13{,}647{,}872 $ FLOPs
Equivalent basic block (two $ 3 \times 3 $ at 256 channels):
$$ 2 \times 256 \times 256 \times 9 \times 196 = 231{,}211{,}008 \text{ FLOPs} $$Savings: $ 231M / 13.6M \approx 17\times $
The $ 1 \times 1 $ Convolution
The $ 1 \times 1 $ convolution (pointwise convolution) is the key enabler of the bottleneck design. Understanding it is essential.
What a $ 1 \times 1 $ convolution does:
At each spatial position $ (h, w) $, the $ 1 \times 1 $ convolution applies a linear transformation to the channel vector:
$$ y_{hw} = W \cdot x_{hw} + b $$where $ x_{hw} \in \mathbb{R}^{C_{in}} $ is the input feature vector at position $ (h, w) $, $ W \in \mathbb{R}^{C_{out} \times C_{in}} $ is the weight matrix, and $ y_{hw} \in \mathbb{R}^{C_{out}} $ is the output.
What it does NOT do:
• It does not look at neighboring spatial positions (no spatial receptive field)
• It does not capture spatial patterns (that is the 3 \times 3 convolution’s job)
What it is good at:
• Changing the number of channels (cheaply)
• Learning linear combinations of feature channels
• Channel mixing: each output channel is a weighted sum of all input channels
• Dimensionality reduction: compressing a high-dimensional channel vector into a lower-dimensional one
Lin et al. (2013) introduced $ 1 \times 1 $ convolutions in the Network-in-Network paper, and GoogLeNet/Inception popularized them for dimension reduction. ResNet’s bottleneck design makes them central to the architecture.
The Compression-Expansion Pattern
The bottleneck’s “compress, process, expand” pattern appears throughout deep learning:
In ResNet bottleneck blocks:
$$ C\xrightarrow{1 \times 1} C/4 \xrightarrow{3 \times 3} C/4 \xrightarrow{1 \times 1} C $$Compress channels by 4x, do spatial processing cheaply, expand back.
In MobileNet inverted bottlenecks:
$$ C \xrightarrow{1 \times 1} 6C \xrightarrow{3 \times 3\, \text{depthwise}} 6C \xrightarrow{1 \times 1} C $$The opposite direction: expand first, process, then compress. Called “inverted” because the wide part is in the middle.
In Transformer FFN:
$$ d_{model} \xrightarrow{W_1} 4 \times d_{model} \xrightarrow{\text{ReLU}} 4 \times d_{model} \xrightarrow{W_2} d_{model} $$Expand by 4x, apply non-linearity, compress back.
The common principle: move to a different dimensionality for processing, then return to the original dimensionality.
The “working dimension” can be larger or smaller depending on the design goals (computational savings vs. representational richness).
Bottleneck vs. Basic: When to Use Which
The ResNet paper uses two different block types depending on model depth:
Basic block (used in ResNet-18 and ResNet-34):
• Two 3 \times 3 convolutions
• Simpler architecture
• Fewer total layers but more FLOPs per block
• Suitable for smaller models where computational budget allows
Bottleneck block (used in ResNet-50, 101, and 152):
• Three convolutions ( 1 \times 1 , 3 \times 3 , 1 \times 1 )
• More layers but fewer FLOPs per block
• Suitable for deeper models where computational efficiency is critical
The crossover point:
ResNet-34 (basic blocks): 21.8M parameters, 3.6 billion FLOPs
ResNet-50 (bottleneck blocks): 25.6M parameters, 3.8 billion FLOPs
Despite having 50 layers vs. 34, ResNet-50 has only slightly more parameters and FLOPs. This is because the bottleneck design is so much more efficient that 50 bottleneck layers cost about the same as 34 basic layers.
ResNet-50 also performs significantly better (76.0% vs. 73.3% top-1 accuracy on ImageNet), showing that the extra depth enabled by bottleneck efficiency translates directly into better performance.
Channel Counts in Practice
The standard ResNet bottleneck uses a 4:1 compression ratio:
Stage 1 ($ 56 \times 56 $):
• Input/output channels: 256
• Bottleneck channels: 64
• Compression: $ 256 \rightarrow 64 \rightarrow 64 \rightarrow 256 $
Stage 2 ($ 28 \times 28 $):
• Input/output channels: 512
• Bottleneck channels: 128
• Compression: $ 512 \rightarrow 128 \rightarrow 128 \rightarrow 512 $
Stage 3 ($ 14 \times 14 $):
• Input/output channels: 1024
• Bottleneck channels: 256
• Compression: $ 1024 \rightarrow 256 \rightarrow 256 \rightarrow 1024 $
Stage 4 ($ 7 \times 7 $):
• Input/output channels: 2048
• Bottleneck channels: 512
• Compression: $ 2048 \rightarrow 512 \rightarrow 512 \rightarrow 2048 $
Note that the input/output channels of bottleneck blocks are 4x the bottleneck channels. This is why ResNet-50 has 2048 channels in its final stage, compared to 512 for ResNet-34 (which uses basic blocks).
The Expansion Factor
The ratio $ C_{out} / C_{bn} $ is called the expansion factor. In the standard ResNet bottleneck, this factor is 4:
$$ C_{out} = 4 \times C_{bn} $$Why 4?
This is an empirical choice that provides a good trade-off:
• Too small (e.g., expansion = 1): no compression advantage, essentially a basic block with an extra layer
• Too large (e.g., expansion = 16): aggressive compression might lose too much information in the bottleneck
• Expansion = 4: the $ 1 \times 1 $ layers are cheap, the $ 3 \times 3 $ layer is affordable, and the compression does not significantly hurt representational power
Later architectures have experimented with different expansion factors:
• ResNeXt: uses grouped convolutions to increase effective width while keeping FLOPs constant
• EfficientNet: uses variable expansion factors (between 1 and 6) optimized by neural architecture search
• RegNet: systematically studies the design space and finds that expansion factors around 2-4 are optimal
Information Flow Through the Bottleneck
What happens to the information as it passes through the three layers?
$ 1 \times 1 $ compress (information bottleneck):
The 256-dimensional feature vector is projected to 64 dimensions. This is a lossy compression: the network must learn which 64 linear combinations of the 256 features are most informative. The projection learns to preserve the features most relevant to the task.
$ 3 \times 3 $ process (spatial reasoning):
In the compressed space, the 3 \times 3 convolution looks at neighboring spatial positions. It can detect edges, textures, and patterns in the compressed feature space. Because the channel count is small (64), this operation is cheap.
$ 1 \times 1 $ expand (information recovery):
The 64-dimensional processed features are projected back to 256 dimensions. This expansion allows the network to “spread” the processed information across the full channel space, creating a rich output representation.
The skip connection ensures nothing is lost:
Even if the bottleneck’s compression is too aggressive and discards useful information, the skip connection preserves the full 256-dimensional input. The output $ y = F(x) + x $ has access to both the original information ($ x $) and the bottleneck’s processed contribution ($ F(x) $).
Parameter Count Analysis
For a bottleneck block with input/output channels $ C $ and bottleneck channels $ C/4 $:
$ 1 \times 1 $ compress:
$$ P_1 = C \times C/4 = C^2/4 $$$ 3 \times 3 $ process:
$$ P_2 = C/4 \times C/4 \times 9 = 9C^2/16 $$$ 1 \times 1 $ expand:
$$ P_3 = C/4 \times C = C^2/4 $$Total:
$$ P_{\text{bottleneck}} = C^2/4 + 9C^2/16 + C^2/4 = 17C^2/16 \approx 1.06C^2 $$Basic block (two $ 3 \times 3 $):
$$ P_{\text{basic}} = 2 \times C^2 \times 9 = 18C^2 $$Ratio:
$$ \frac{P_{\text{basic}}}{P_{\text{bottleneck}}} = \frac{18}{1.06} \approx 17 $$The bottleneck has 17x fewer parameters than the basic block. This is the same ratio as the computational savings, because both FLOPs and parameters scale with the same product of channel dimensions and kernel size.
Bottleneck Blocks in ResNet Variants
The number of bottleneck blocks varies across ResNet configurations:
ResNet-50: 3 + 4 + 6 + 3 = 16 bottleneck blocks
ResNet-101: 3 + 4 + 23 + 3 = 33 bottleneck blocks
ResNet-152: 3 + 8 + 36 + 3 = 50 bottleneck blocks
The depth primarily increases in Stage 3 (the 14 \times 14 resolution stage). This is because Stage 3 has a good balance of spatial resolution (enough to capture meaningful spatial patterns) and manageable computation (not too large spatially, not too many channels).
Beyond the Standard Bottleneck
Several important architectures have modified the bottleneck design:
ResNeXt (2017):
Replaces the single $ 3 \times 3 $ convolution with $ G $ parallel grouped convolutions:
$$ F(x) = \sum_{i=1}^{G} W_3^{(i)} \cdot \text{ReLU}(\text{BN}(W_2^{(i)} * \text{ReLU}(\text{BN}(W_1^{(i)} \cdot x)))) $$This increases the “cardinality” (number of parallel paths) while keeping FLOPs constant. ResNeXt-50 outperforms ResNet-50 at the same computational budget.
SE-ResNet (Squeeze-and-Excitation, 2018):
Adds a channel attention mechanism after the bottleneck:
$$ y = F(x) \cdot \text{SE}(F(x)) + x $$The SE module learns to re-weight channels based on their importance, adding very few parameters.
EfficientNet (2019):
Uses neural architecture search to find optimal bottleneck configurations (expansion ratio, kernel size, channel count) for each block in the network.
These extensions all build on the bottleneck’s core idea of compress-process-expand, showing how fundamental and versatile this design pattern is.
Gradient Flow Through the Bottleneck
The gradient flow through a bottleneck block follows the same residual principle as basic blocks, but with an additional layer in the residual branch:
Forward:
$$ y = h_3 + x = (\text{BN}(W_3 \cdot \text{ReLU}(\text{BN}(W_2 * \text{ReLU}(\text{BN}(W_1 \cdot x)))))) + x $$Backward (through skip connection):
$$ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(\frac{\partial h_3}{\partial x} + I\right) $$The identity term $ I $ provides the gradient highway, just as in basic blocks. The gradient through the residual branch now passes through three layers instead of two, but the skip connection ensures that even if this three-layer gradient is small, the overall gradient remains healthy.
The three-layer gradient path:
$$ \frac{\partial h_3}{\partial x} = \frac{\partial h_3}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial x} $$Each factor involves a convolution Jacobian and a ReLU mask. With three multiplicative factors instead of two (as in basic blocks), the residual gradient through the bottleneck branch may be smaller. However, the skip connection compensates: the identity gradient is independent of the number of layers in the branch.
This is precisely why the bottleneck design works: the skip connection’s gradient is always 1, regardless of how many layers are stacked in the residual branch. The computational savings of the bottleneck come without any cost to gradient flow.
The Role of Batch Normalization in Bottleneck Blocks
Batch normalization appears three times in each bottleneck block, once after each convolution:
$$ h_1 = \text{ReLU}(\text{BN}_1(W_1 \cdot x)) $$$$ h_2 = \text{ReLU}(\text{BN}_2(W_2 * h_1)) $$$$ h_3 = \text{BN}_3(W_3 \cdot h_2) $$Each BatchNorm serves a specific purpose:
• $ \text{BN}_1 $ (after $ 1 \times 1 $ compress): Normalizes the compressed representation before ReLU. Without this, the distribution shift from the channel reduction could cause most values to be negative (zeroed by ReLU), wasting capacity.
• $ \text{BN}_2 $ (after $ 3 \times 3 $ process): Normalizes the spatially processed features. The $ 3 \times 3 $ convolution aggregates neighboring positions, which can create features with larger variance. BatchNorm resets the scale.
• $ \text{BN}_3 $ (after $ 1 \times 1 $ expand): Normalizes the expanded representation before it is added to the skip connection. This is crucial: if $ h_3 $ has a very different scale from $ x $, the addition $ h_3 + x $ would be dominated by whichever term is larger. BatchNorm ensures both terms contribute meaningfully.
Note on the final ReLU:
The ReLU after the addition ( $ y = \text{ReLU}(h_3 + x) $ ) is applied to the combined output. This means no BatchNorm appears between the addition and the final activation, keeping the skip connection as clean as possible (only one non-linearity on the shortcut path).
Why Deeper Networks Need Bottleneck Blocks
The relationship between network depth and block type is not arbitrary:
Shallow networks (18-34 layers) use basic blocks because:
• Few enough blocks that the higher per-block cost is affordable
• Simpler gradient flow through only two layers per block
• Sufficient capacity without channel expansion ( 4 \times )
Deep networks (50+ layers) use bottleneck blocks because:
• Many blocks require low per-block cost for tractability
• The 17 \times computational savings per block accumulate dramatically over 50+ layers
• The 4 \times channel expansion provides more capacity per parameter
• The three-layer structure is not a disadvantage when skip connections handle gradient flow
The depth-efficiency trade-off:
A ResNet-152 with basic blocks would require:
$$ \text{FLOPs} \approx 17 \times 11.3\text{B} = 192\text{B FLOPs} $$This would be impractical. The bottleneck design makes 152-layer networks feasible at only 11.3B FLOPs, comparable to a 34-layer basic network (3.6B FLOPs) in terms of wall-clock training time per image.