Fourier Neural Operators (FNOs)[1] have been quite successful to say the least with a variety of immediate practical applications from accurate high-resolution weather forecasting, Carbon Capture and Storage (CCS) simulations, industrial-scale automotive aerodynamics, and material deformation to name a few[2].
Here's a quick overview of FNOs $-$ Recall that the general form for a Neural Operator as a composition of layers $v$ of the form[3]
$$ v^{l+1}(\vx) := \sigma \left (\vW v^l(\vx)+ \int_\fX k(\vx,\vz) v^l (\vz) d \vz \right ) $$
Where $\sigma$ is the usual pointwise nonlinearity, and $k$ is some kernel function. The motivation here is that, to learn an operator as a mapping between functions, we have to go beyond just pointwise transformations like $\vW v(\vx)$. The integral part in the above equation defines a family of functional transforms called Kernel Integral Transformations, thus enabling the network to learn a more general form of the input-output functional mapping. The integral over the input domain $\fX \subset \R^d$ ensures that the network is not restricted to learning the fixed number of input function measurements or their locations.
We can identify that the integral term represents a convolution operation, provided the kernel $k$ is stationary (translation invariant) i.e $k(\vx, \vz) = k(\vx - \vz)$. This immediately suggests the use of Fourier transforms to convert the convolution to matrix multiplication, which is computationally more efficient than computing the integral directly.
$$ \int_\fX k(\vx, \vz) v(\vz) dz = \int_\fX k(\vx-\vz) v(\vz) dz = \fF^{-1}(\fF(k) \cdot \fF(v))(\vx) $$
Therefore, instead of designing a network to learn the kernel $k: \fX \to \R^{d_v \x d_v}$, we now directly learn its Fourier transform $\fF(k)$, represented as $R_{\phi}$.
$$ v^{l+1} (\vx) := \sigma \left (\vW v^l(\vx)+ \fF^{-1}(R_{\phi} \cdot \fF(v^l))(\vx)\right ) $$
Here, I would like to point out that the above equation has the form of a residual layer (you may have to squint hard). I find it funny that DNN researchers have this obsession about showing that their layer architecture looks like a residual layer. Anyhow.
Here, we note a couple of things $-$ FNOs work on uniformly discretized domains $\fX$; and the kernel $k$ is assumed to be periodic and bandlimited.
- A uniformly discrete domain implies that we can use FFT and IFFT.
- A periodic kernel means that we can parameterize $R_\phi$ in terms of discrete Fourier modes $p$, and a bandlimited kernel means that we can truncate the Fourier series to some maximum value $p_{\text{max}}$, resulting in a finite-dimensional representation of $R_\phi$ - a complex-valued tensor of size $(p_{\text{max}} \x d_v \x d_v)$[4].
These assumptions (a total of 3 including the stationary assumption) are crucial for an efficient practical implementation of FNOs. In the actual implementation of FNOs, there is a neat little trick[5] for computing this Fourier multiplication. Refer to a typical implementation for 2D inputs shown below - $R$ is no where to be seen! and what are those weights1
and weights2
?
# Reference: https://github.com/neuraloperator/neuraloperator/blob/13c7f112549bfcafe09a4c5512a90206141b3511/neuralop/layers/spectral_convolution.py#L526
import torch
import torch.nn as nn
class FourierLayer2D(nn.Module):
def __init__(self,
d_in: int,
d_out: int,
modes1: int,
modes2: int):
"""
Fourier Layer for 2D inputs. Computes the following operation:
v_{l+1} = F^{-1}( F(v_l) * R )
where F is the Fourier Transform, R is the kernel's Fourier coefficients,
Args:
d_in: int: Dimension of input
d_out: int: Dimension of output
modes1: int: Max number of modes in the x-direction
modes2: int: Max number of modes in the y-direction
"""
super(FourierLayer2D, self).__init__()
self.d_in = d_in
self.d_out = d_out
self.modes1 = modes1
self.modes2 = modes2
self.scale = (1 / (d_in * d_out))
# Set the complex weights to parameterize the
# Kernel's Fourier coefficients R.
self.weights1 = nn.Parameter(self.scale *
torch.rand(d_in,
d_out,
self.modes1,
self.modes2, dtype=torch.cfloat))
self.weights2 = nn.Parameter(self.scale *
torch.rand(d_in,
d_out,
self.modes1,
self.modes2, dtype=torch.cfloat))
def forward(self, v: torch.Tensor):
batchsize, M, N = v.shape[0], v.shape[-2], v.shape[-1]
# Compute Fourier transform of the
# input tensor v. Note that this is the
# output of the previous layer.
v_ft = torch.fft.rfft2(v)
# Truncate the Fourier modes and multiply with the
# weight matrices
out_ft = torch.zeros(batchsize,
self.d_out,
M,
N//2 + 1,
dtype=torch.cfloat, device=v.device)
out_ft[:, :, :self.modes1, :self.modes2] = \
self.compl_mul2d(v_ft[:, :, :self.modes1, :self.modes2], self.weights1)
out_ft[:, :, -self.modes1:, :self.modes2] = \
self.compl_mul2d(v_ft[:, :, -self.modes1:, :self.modes2], self.weights2)
# Inverse Fourier Transform to get the layer outputs in
# the input physical space.
v = torch.fft.irfft2(out_ft, s=(v.size(-2), v.size(-1)))
return v
def compl_mul2d(self, tensor1: torch.Tensor, tensor2: torch.Tensor):
# (batch, d_in, x,y ), (d_in, d_out, x,y) -> (batch, d_out, x,y)
return torch.einsum("bixy,ioxy->boxy", tensor1, tensor2)
Let's try to understand what's happening here. The 2D output tensor $\vv \in \R^{d_v \x M \x N}$ of the previous layer is the input to the current layer and is Fourier transformed using rfft2
. Since the input is real-valued, the corresponding Fourier transform is Hermitian and therefore, it is efficient to use rfft2
instead of fft2
. Therefore, $\fF(\vv) \in \C^{d_v \x M \x (N/2 + 1)}$ and correspondingly the result of the spectral operation (equation (2)) $\in \C^{d_v \x M \x (N /2 + 1)}$ (note fewer parameters).
Now, the trick is in understanding that during truncation, we have to consider both the positive and negative Fourier modes. Our actual range of truncation is $[-p_{\text{max}}, p_{\text{max}}]$ . If the input domain is discretized into $N$ and $M$ points along respective dimensions, we truncate the Fourier modes as v_ft[:, :, :self.modes1, :self.modes2]
(positive frequencies) and v_ft[:, :, -self.modes1:, :self.modes2]
(negative modes). This can be verified based on the figure above. Therefore, in practice, instead of one $R$ matrix, we decompose it into two $R_1, R_2 \in \C^{p_{\text{max}} \x q_{\text{max}} \x d_v \x d_v}$ matrices, each corresponding to the positive and negative Fourier modes respectively. This results in a more efficient implementation that is extensible to higher dimensions as well[6].
[1] | Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., & Anandkumar, A. (2020). Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895. |
[2] | For a more detailed presentation of the applications of Neural Operators in general, refer to - Azizzadenesheli, K., Kovachki, N., Li, Z., Liu-Schiaffini, M., Kossaifi, J., & Anandkumar, A. (2024). Neural operators for accelerating scientific simulations and design. Nature Reviews Physics, 1-9. |
[3] | The general form was described in the paper - Li, Zongyi, et al. Neural operator: Graph kernel network for partial differential equations. arXiv preprint arXiv:2003.03485 (2020). |
[4] | The bandlimited assumption may not always hold and yet we truncate to $p_{\text{max}}$ modes. Hence, it is usualy recommended to play around with the maximum frequency $p_{\text{max}}$ for each application. The original FNO paper recommends $p_{\text{max}, j} = 12$ for each channel/feature dimension $j$. |
[5] | A clean implementation of the Fourier multiplication can be found here. |
[6] | This trick was actually explained in the paper - Kossaifi, J., Kovachki, N., Azizzadenesheli, K., & Anandkumar, A. (2023). Multi-grid tensorized Fourier neural operator for high-resolution PDEs. arXiv preprint arXiv:2310.00120. |