### Speaker

### Description

Computing the permanent of a matrix finds an important application in the context of boson sampling. Using the BB/FG permanent formula with a reflected binary Gray code, we implemented an FPGA design aimed at maximizing the use of logic and DSP resources to increase the parallelism and reducing the time complexity from $\mathcal{O}(n.2^{n-1})$ to $\mathcal{O}(n.2^{n-3})$. This can be furthered reduced by half using both cards of the FPGA in a dual array mode of operation. To allow the design to scale up to 40x40 matrices and achieve a speed of 280MHz, we used a properly pipelined state-of-the-art rectangular multiplier algorithm to match the structure of the DSP units of the FPGA.

In practical boson sampling configurations, there will be photons sharing an optical mode, which mathematically refers to computing the permanent of a matrix with independent repeated rows and repeated columns. This allows for a generalization of BB/FG using binomial coefficients to simplify multiplicities across rows or columns:

\begin{equation}

\text{rp}(\boldsymbol{A}, \boldsymbol{M}, \boldsymbol{N})=\frac{1}{2^{n-1}} \sum\limits_{\boldsymbol{\Delta}} \bigg(\prod\limits_{k=1}^{m} (-1)^{\Delta_k} {M_k \choose \Delta_k}\bigg) \prod\limits_{j=1}^{m} \bigg(\sum\limits_{k=1}^{m}

\left(M_k-2\Delta_k\right)a_{k, j}\bigg)^{N_j}

\end{equation}

where $A$ is a square matrix describing the interferometer with $m$ modes, $\boldsymbol{M}$ and $\boldsymbol{N}$ are the row and column multiplicities respectively such that the photon count $n=\sum\limits_{r\in R}=\sum\limits_{c\in C}$ and $\Delta$ is an n-ary Gray code, required for efficient computation. Computing binomial coefficients efficiently presents design challenges on the FPGA. By constructing a large enough loop, this can be resolved but it requires special logic around the n-ary Gray code. Certain techniques like Guan codes are insufficient as they lack the reflection property being computable with simple logic so we based our method on a dynamic programming technique. We extended this approach to stagger the Gray code at precise even intervals based on the loop length, incurring a constant initialization summation cost, and then a streamlined operation. To avoid division, we implemented division via multiplication by "magic numbers". To account for the BB/FG "anchor" row and allow the $4$ parallel operations to proceed with a simultaneous "smooth" same row update, we reduce the smallest multiplicity up to three times to maintain optimal complexity.

All our implementations were designed to automatically reset on completion, providing for batching capability, an important optimization for FPGA designs and fitting perfectly into the context of boson sampling, where many permanents are computed and the batch size equals the matrix size. As we used fixed point arithmetic, we conducted accuracy testing against a CPU infinite precision calculator. We benchmarked against similar maximally efficient implementations on CPU. These are a part of the piquassoboost extension to the piquasso library. We measured two important metrics against the CPU: the matrix size cross-over threshold due to FPGA initialization time delay and performance speed-up. For non-repeated permanents, we achieve an 8.86x speed-up over CPU with a cross-over threshold at 16x16 matrices, which batching reduces it to 9x9 matrices. For repeated row/column permanents, it is dependent upon the number of photons. For 20 photons, the speed up is 5.9x and cross-over 25x25, while with batching the speed-up is 15x and the cross-over is 15x15. Our effective equivalent if long double floating point operations were used is $\frac{(C_A+C_M)*280*10^6}{10^9}$ where $C_A=2A=2*(40+4)$ and $C_M=4M+2A=6*4*39$ represent complex addition and multiplication respectively, yielding $285.5$ GFLOPS.