Jekyll2022-07-22T12:26:54+00:00/feed.xmlcorrelation.zoneCoral Westoby RA @ CTSRD/CHERIGraphcore: headerless packets and the perfect NOC2020-01-25T20:32:40+00:002020-01-25T20:32:40+00:00/networks/graphcore/2020/01/25/graphcoreNoC<h1 id="introduction">Introduction</h1>
<p>Network on chip (NoC) designs have been relatively isolated from the software that runs using them, treated as an implementation detail while the instruction set and memory model are exposed to the programmer. Graphcore has totally exposed the NoC to the program, and in doing so routes packets without headers.</p>
<h1 id="graphcore">Graphcore?</h1>
<p>Graphcore is a Bristol, UK based chip design startup developing a ML accelerator equalling current NVIDIA parts for direct matrix multiply performance but (probably) maintaining that same performance on less structured workloads. This is in contrast to the standard GPGPU model that suffers significant performance degradation on programs with unpredictable control flow.</p>
<p>Chip multiprocessors need to ensure the interconnect can sustain high instruction throughput for the cores at the minimum area and power cost for a target workload. For a CPU design, this will often take the form of an interconnected series of tradeoffs between cache size and behaviour, interconnect design around coherency, bisection bandwidth, topology, and routing strategy, and the downstream memory controllers. This is made significantly more difficult by the target market for a CPU requiring improved performance often for the same binaries, so chip designers are constrained not only by technological limitations but also by the optimisation decisions of the past.</p>
<p>Prior general purpose chip designs have targeted high single-thread performance at the expense of parallelism or bisection bandwidth. This has lead to common designs for network-on-chip systems targeting low area - from busses for very low core counts, to ring designs for ~10 core ships (IBM CELL, Xeon 8-core), then (modified) mesh layouts with packet routing for larger designs.</p>
<p>GPGPU designs modify this by adding many more processing elements than the memory controllers could support for CPU code, and relying on the programmer to allocate tasks that are able to saturate the vast number of ALUs without requiring communication or irregular memory access to fetch new data or code. The typical example is a matrix multiply, where for the naive implementation the kernel size is constant, caches can be fully utilised, and memory access is totally regular.</p>
<h1 id="the-graphcore-and-celerity-network-designs">The Graphcore and Celerity network designs</h1>
<p>The Graphcore targets a different design point - moderately branching, highly parallel, ultra-highly bandwidth intensive algorithms. A possible example is page rank on a dense graph. This workload requires updates to be sent along every edge for each iteration for the classic implementation. This allows a very small amount of data to overwhelm nearly any interconnect.</p>
<p>Another chip design targeting the same tightly integrated mesh of weak cores is the Celerity manycore RISC-V accelerator chip. Celerity is a tiered design incorporating 5 Rocket and 496 Vanilla-5 cores + a specialisation layer. The 496 Vanilla-5 cores form a mesh that provides a useful design point comparison with the Graphcore, but performance comparisons would be misleading as the Grahcore uses a entire ~800mm^2 TSMC 16nm reticle - whereas the Celerity mesh is tapped out into 15.24 mm² of silicon.</p>
<p>I believe both networks use a 2D grid of connections between routers, and a single core per router design. Where Graphcore and Celerity differ is the programming model exposed by this mesh.</p>
<p>Celerity packets are single flits, including a 32bit message header and 32bit payload. There is no wormhole routing or virtual circuits - each packet/flit is individually routed. This enables transfers to instantly hit full bandwidth, but conversely halves the peak bandwidth for a given number of wires vs a virtual circuit design. This is an aggressive but not revolutionary design point, suitable for the very tight delivery timelines within the DARPA Circuit Realization At Faster Timescales (CRAFT) project.</p>
<p>Graphcore also uses a mesh NOC, but emulates an all-to-all crossbar at O(N) wire cost with compiler assistance. A graph core packet is also a single flit, but has no header at all. This makes it impossible to route a packet without external help. Instead, the cores are devoted to implementing a totally predetermined routing pattern during communication portions of the program.</p>
<p>The Graphcore programming model flows a Bulk Synchronous Parallel model: on N cores, N threads make progress up to a communications point and then block. Once all threads have reached the sync point the entire processor transitions into a communications period simultaneously.</p>
<p>During the communications period, cores operate in lockstep. Each clock period, a core may send a flit, store a flit from one port into a register slot, reconfigure it’s router’s forwarding table, or all of the above.</p>
<p>The on-chip interconnect is time-deterministic and uncontended. Using this property, the graph core compiler has produced code for the communication phase that emits and receives messages over the interconnect according to a core-local schedule. There is no method for to be notified when a core receives a flit - the program makes use of a global clock within the communication period to count cycles, and will directly copy the data on the input bus into local cache when a valid message is scheduled to be received.</p>
<p>The transmission latency depends on the source and destination tile id. A recent benchmark paper demonstrated very regular patterns to the latency that probably enables effective compiler heuristics for arranging communicating threads.</p>
<p>In order to correctly route packets without a header each intermediary core will reconfigure its router to correctly forward incoming data. The compiler has produced a complete schedule for this phase and so is able to emit this instruction sequence. This is how the graph core interconnect can appear to be a full crossbar at O(N^2) wire cost, whilst only using O(N) wire resources - compiler scheduling will avoid packet conflicts by using many possible routes to each core. Each core can only send unicast messages so the performance difference is not observable.</p>
<p>Graphcore puts emphasis on the BSP model not sacrificing performance as it may be impossible to power both the cores and interconnect simultaneously. This sounds implausible as the cores occupy significantly more area and operate on the same clock. A more accurate statement may be that using the communication phase, the ALUs are powered down and routers powered up - with the cores still performing instruction decode and register operations for the communications sequence.</p>
<p>The Graphcore design is a significant innovation in hardware/compiler codesign and will offer a unique performance tradeoff, with programmers able to use huge bandwidth and ultra low latency interconnects for massively parallel programs - a offering that was previously assumed to be impossible!</p>
<h3 id="references">References</h3>
<p>https://fuse.wikichip.org/news/3217/a-look-at-celeritys-second-gen-496-core-risc-v-mesh-noc/</p>
<p>Synchronisation in a Multi-Tile processing arrangement, GB2569269, S. Knowles, A. Alexander, 2017</p>IntroductionBenchmarking Tensorflow’s autograph for arbitrary code2019-07-28T15:07:40+00:002019-07-28T15:07:40+00:00/python/heterogeneous/tensorflow/quick/2019/07/28/Tensorflow-Autograph<p>With the speed of light being unfortunately fixed, and the corollary that heterogeneous architectures with task specific data movement offer the path to higher FLOPs, moving existing CPU software to post-single-thread platforms is the price to pay for seeing post-2000’s Moore’s Law<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>TensorFlow ships a tool for automatically creating TF graphs from pretty general Python functions<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. This gives us a easy method for testing the overhead of “porting” a existing Python code to any TF target by converting it directly into a graph.</p>
<p>To test the compile and runtime performance, I grabbed a pure Python implementation of a cryptographic algorithm from [https://github.com/ajalt/python-sha1]. This was chosen despite SHA1 being easily implemented with hardware acceleration, as it should produce reasonable depth control flow graphs.</p>
<p>The compilation and runtime performance was benchmarked with the following loop with SHA1 depths from 1 to 1000.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">iters</span> <span class="ow">in</span> <span class="n">iter_counts</span><span class="p">:</span>
<span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">Graph</span><span class="p">().</span><span class="n">as_default</span><span class="p">():</span>
<span class="n">hfinal</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">constant</span><span class="p">(</span><span class="n">h0</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
<span class="n">hfinal</span> <span class="o">=</span> <span class="n">tf_sha1</span><span class="p">(</span><span class="n">hfinal</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">hfinal</span><span class="p">)</span>
<span class="n">dt</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span><span class="o">-</span><span class="n">t0</span>
<span class="k">print</span><span class="p">(</span><span class="n">iters</span><span class="p">,</span> <span class="s">"iterations took"</span><span class="p">,</span> <span class="n">dt</span><span class="p">,</span> <span class="s">"seconds"</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/tf_plot.png" alt="plot" /></p>
<p>The SHA1 code was approximately 70 lines long. If you were to attempt to throw arbitrary Python into Autograph like this, you could reasonably expect it to take 21 seconds per kloc of Python.</p>
<p>Just do it properly.</p>
<p>References:</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>[https://herbsutter.com/welcome-to-the-jungle/] <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>[https://medium.com/tensorflow/autograph-converts-python-into-tensorflow-graphs-b2a871f87ec7] <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>With the speed of light being unfortunately fixed, and the corollary that heterogeneous architectures with task specific data movement offer the path to higher FLOPs, moving existing CPU software to post-single-thread platforms is the price to pay for seeing post-2000’s Moore’s Law1. [https://herbsutter.com/welcome-to-the-jungle/] ↩10G Network Router in Haskell2019-01-14T15:07:40+00:002019-01-14T15:07:40+00:00/haskell/fpga/2019/01/14/10G-network-router<p>As part of Noa Z’s course, we modified the NetFPGA SUME reference design to support cut-through low latency switching at 10G line rate.</p>
<p>As part of this project a new output port lookup module was written in CλaSH, a Haskell derived DSL compiling to Verilog. The pipelined design was exceptionally succinct when compared to the reference Verilog design.</p>
<p>The document is a literate haskell walk-through of the OPL module, for use as a building block for other pipelined systems. The code cannot be used as-is without the closed NetFPGA SUME codebase and so imports/external references are deliberately omitted with salient details given in the text.</p>
<h2 id="cut-through-switching">Cut-through switching</h2>
<p>A cut-through ethernet switch has a latency that is not dependant on the length of the packet. Cut-through designs commonly have lower average latency and much lower worst case latency than store-and-forward designs. THe advent of FEC in 40G and above speeds obsoletes cutthrough designs as the FEC code, located at the end of the packet, is required to decode the destination MAC for layer 2 routing.</p>
<h2 id="the-architecture-of-the-opl">The architecture of the OPL</h2>
<p>The Xilinx MAC’s use the AXI-stream protocol for transmitting data. The MAC’s use a 156.25MHz internal clock and this design runs at that rate without gearboxes for simplicity.</p>
<h3 id="axi-stream-and-pipeline-depth">AXI Stream and pipeline depth</h3>
<p>A sample of the data on a AXI stream bus is given by the following type:</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="kr">type</span> <span class="kt">Tdata</span> <span class="o">=</span> <span class="kt">BitVector</span> <span class="mi">64</span> <span class="c1">-- 156.25MHz * 64b = 10g (with preamble overhead)</span>
<span class="kr">type</span> <span class="kt">Tkeep</span> <span class="o">=</span> <span class="kt">BitVector</span> <span class="mi">8</span>
<span class="c1">-- tuser defined elsewere</span>
<span class="kr">type</span> <span class="kt">Tvalid</span> <span class="o">=</span> <span class="kt">Bit</span>
<span class="kr">type</span> <span class="kt">Tready</span> <span class="o">=</span> <span class="kt">Bit</span>
<span class="kr">type</span> <span class="kt">Tlast</span> <span class="o">=</span> <span class="kt">Bit</span>
<span class="kr">data</span> <span class="kt">Stream</span> <span class="o">=</span> <span class="kt">Stream</span> <span class="p">{</span> <span class="c1">-- AXI stream single bus width sample</span>
<span class="n">tdata</span> <span class="o">::</span> <span class="kt">Tdata</span><span class="p">,</span> <span class="c1">-- packet data</span>
<span class="n">tkeep</span> <span class="o">::</span> <span class="kt">Tkeep</span><span class="p">,</span> <span class="c1">-- 8-bit-granularity valid lines</span>
<span class="n">tuser</span> <span class="o">::</span> <span class="kt">Tuser</span><span class="p">,</span> <span class="c1">-- metadata</span>
<span class="n">tvalid</span> <span class="o">::</span> <span class="kt">Tvalid</span><span class="p">,</span> <span class="c1">-- global tkeep</span>
<span class="n">tready</span> <span class="o">::</span> <span class="kt">Tready</span><span class="p">,</span> <span class="c1">-- backpressure signal</span>
<span class="n">tlast</span> <span class="o">::</span> <span class="kt">Tlast</span> <span class="c1">-- metadata for end of burst</span>
<span class="p">}</span> <span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">)</span>
<span class="c1">-- it is useful to have empty instances to create instances from</span>
<span class="n">emptyStream</span> <span class="o">::</span> <span class="kt">Stream</span> <span class="o">=</span> <span class="kt">Stream</span> <span class="p">{</span> <span class="n">tdata</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tkeep</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tuser</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tvalid</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tready</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">tlast</span><span class="o">=</span><span class="mi">0</span> <span class="p">}</span></code></pre></figure>
<p>Our OPL module has 8 cycles of latency. 3 clocks are required to get the source MAC. 4 more clocks are needed in order to interleave the 4 ports of access to the MAC->PORT mapping module. A final clock is used to simplify output logic. The source MAC is required in order to act as a <em>learning</em> switch, as we may need to update our MAC->PORT mapping with this source and so the lookup cannot simply begin until we know this.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<h3 id="pipeline-declaration">Pipeline declaration</h3>
<p>The pipeline type contains the AXI stream connections within the module, as well as all the metadata the OPL has calculated so far. The module is a “conveyor belt” design that moves the data in a <code class="language-plaintext highlighter-rouge">PipelineStage</code> forward every cycle, modifying the data as it goes.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="kr">data</span> <span class="kt">PipelineStage</span> <span class="o">=</span> <span class="kt">PipelineStage</span> <span class="p">{</span>
<span class="n">packet</span> <span class="o">::</span> <span class="kt">Stream</span><span class="p">,</span>
<span class="n">pktValid</span> <span class="o">::</span> <span class="kt">Bool</span><span class="p">,</span>
<span class="n">pktLast</span> <span class="o">::</span> <span class="kt">Bool</span><span class="p">,</span>
<span class="n">flitCount</span> <span class="o">::</span> <span class="kt">Unsigned</span> <span class="mi">8</span><span class="p">,</span>
<span class="n">dstMac</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">48</span><span class="p">),</span>
<span class="n">srcMac</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">48</span><span class="p">),</span>
<span class="n">dstPorts</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">8</span><span class="p">),</span>
<span class="n">srcPorts</span> <span class="o">::</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">8</span><span class="p">),</span>
<span class="n">packetLen</span> <span class="o">::</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">16</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">pipelineDefault</span> <span class="o">=</span> <span class="kt">PipelineStage</span> <span class="p">{</span> <span class="n">packet</span> <span class="o">=</span> <span class="n">emptyStream</span><span class="p">,</span>
<span class="n">pktValid</span> <span class="o">=</span> <span class="kt">False</span><span class="p">,</span> <span class="c1">-- first flit?</span>
<span class="n">pktLast</span> <span class="o">=</span> <span class="kt">False</span><span class="p">,</span>
<span class="n">flitCount</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">dstMac</span> <span class="o">=</span> <span class="kt">Nothing</span><span class="p">,</span>
<span class="n">srcMac</span> <span class="o">=</span> <span class="kt">Nothing</span><span class="p">,</span>
<span class="n">dstPorts</span> <span class="o">=</span> <span class="kt">Nothing</span><span class="p">,</span>
<span class="n">srcPorts</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">packetLen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">}</span>
<span class="kr">type</span> <span class="kt">OPLState</span> <span class="o">=</span> <span class="kt">Vec</span> <span class="mi">7</span> <span class="kt">PipelineStage</span></code></pre></figure>
<p>This writing style keeps all the data and metadata together. The downside is much wider datapaths in clash than would be required for just buffering the AXI stream for 8 clocks. The clash compiler did not attempt to do any wiring simplification but this should be removed at elaboration time.</p>
<p>The Haskell Maybe type maps exactly to a Verilog type with a additional valid wire. If a Mabye type is exposed, it is easy enough to make combinatorial packers and unpackers that maintain idiomatic styles in both languages.</p>
<h3 id="packet-routing-functions">Packet routing functions</h3>
<p>Due to the designation mapper being a bottleneck in our design, we disabled support for the NetFPGA host ports. We can use maybe to set the DMA designations to zero only if they were valid.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="c1">-- to avoid blocking the design, we need to never send to DMA</span>
<span class="c1">-- all odd bits are the DMA ports</span>
<span class="n">stripDMA</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">8</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">8</span><span class="p">)</span>
<span class="n">stripDMA</span> <span class="n">dst</span> <span class="o">=</span> <span class="n">maybe</span> <span class="kt">Nothing</span> <span class="p">(</span><span class="nf">\</span><span class="n">p</span> <span class="o">-></span> <span class="kt">Just</span> <span class="p">(</span><span class="n">p</span> <span class="o">.&.</span> <span class="p">(</span><span class="o">$$</span><span class="p">(</span><span class="n">bLit</span> <span class="s">"01010101"</span><span class="p">)</span> <span class="o">::</span> <span class="kt">BitVector</span> <span class="mi">8</span><span class="p">)))</span> <span class="n">dst</span></code></pre></figure>
<p>Next we route the packets based upon the metadata in the current pipeline stage. Haskell pattern matching makes this kind of if-else logic clear. We set the destination for packets to the bcast MAC to all ports, otherwise if the MAC was not seen in the lookup table we send to all but the source port. If we have a successful lookup we use that routing.</p>
<p>Without the INLINE declaration, every haskell function generates a new Verilog module. For small functions like this it is simpler to fold all the logic into one module.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">extractDst</span> <span class="o">::</span> <span class="kt">TCAMreply</span> <span class="o">-></span> <span class="kt">PipelineStage</span> <span class="o">-></span> <span class="kt">Maybe</span> <span class="p">(</span><span class="kt">BitVector</span> <span class="mi">8</span><span class="p">)</span>
<span class="cp">{-# INLINE extractDst #-}</span> <span class="c1">-- we use inline to reduce the number of floating modules in verilog</span>
<span class="n">extractDst</span> <span class="kr">_</span> <span class="kt">PipelineStage</span><span class="p">{</span><span class="n">dstMac</span> <span class="o">=</span> <span class="kt">Just</span> <span class="mh">0xffffffffffff</span><span class="p">}</span> <span class="o">=</span> <span class="kt">Just</span> <span class="p">(</span><span class="o">$$</span><span class="p">(</span><span class="n">bLit</span> <span class="s">"11111111"</span><span class="p">)</span> <span class="o">::</span> <span class="kt">BitVector</span> <span class="mi">8</span><span class="p">)</span> <span class="c1">-- bcast</span>
<span class="n">extractDst</span> <span class="kt">TCAMreply</span><span class="p">{</span> <span class="n">lut_miss</span> <span class="o">=</span> <span class="mi">1</span> <span class="p">}</span> <span class="n">state</span> <span class="o">=</span> <span class="kt">Just</span> <span class="p">(</span><span class="n">complement</span> <span class="p">(</span><span class="n">srcPorts</span> <span class="n">state</span><span class="p">))</span>
<span class="n">extractDst</span> <span class="kt">TCAMreply</span><span class="p">{</span> <span class="n">lut_hit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">dst_ports</span><span class="o">=</span><span class="n">dst</span> <span class="p">}</span> <span class="kr">_</span> <span class="o">=</span> <span class="kt">Just</span> <span class="n">dst</span>
<span class="n">extractDst</span> <span class="kr">_</span> <span class="kr">_</span> <span class="o">=</span> <span class="kt">Nothing</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">applyTcamRpy</code> looks for the results of a lookup in the current pipeline stage if this flit is the first one in a packet. Else, we replicate the destination from the last flit. This ensures that packets are not fragmented.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">applyTcamRpy</span> <span class="o">::</span> <span class="kt">PipelineStage</span> <span class="o">-></span> <span class="kt">Bit</span> <span class="o">-></span> <span class="kt">TCAMreply</span> <span class="o">-></span> <span class="kt">PipelineStage</span>
<span class="cp">{-# INLINE applyTcamRpy #-}</span> <span class="c1">-- we use inline to reduce the number of floating modules in verilog</span>
<span class="n">applyTcamRpy</span> <span class="n">now</span> <span class="n">rpyenb</span> <span class="n">iTCAM</span> <span class="o">=</span> <span class="n">now</span> <span class="p">{</span> <span class="n">dstPorts</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">headFlit</span> <span class="n">now</span> <span class="o">&&</span> <span class="n">btb</span> <span class="n">rpyenb</span>
<span class="kr">then</span> <span class="n">stripDMA</span> <span class="p">(</span><span class="n">extractDst</span> <span class="n">iTCAM</span> <span class="n">now</span><span class="p">)</span> <span class="c1">-- add the new dst</span>
<span class="kr">else</span> <span class="n">dstPorts</span> <span class="n">now</span> <span class="p">}</span> <span class="c1">-- no rpy, keep looping</span></code></pre></figure>
<h3 id="the-lookup-pipeline">The lookup pipeline</h3>
<p>The final function is also the largest. At 100 lines of Haskell, it declares the action on each pipeline stage. The type is</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">opl_pass_mealy</span> <span class="o">::</span> <span class="kt">OPLState</span> <span class="o">-></span> <span class="p">(</span><span class="kt">Stream</span><span class="p">,</span> <span class="kt">TCAMreply</span><span class="p">,</span> <span class="kt">Bit</span><span class="p">,</span> <span class="kt">Bit</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="kt">OPLState</span><span class="p">,</span> <span class="p">(</span><span class="kt">Stream</span><span class="p">,</span> <span class="kt">TCAMrequest</span><span class="p">))</span>
<span class="n">opl_pass_mealy</span> <span class="n">state</span> <span class="p">(</span><span class="n">iAXI</span><span class="p">,</span> <span class="n">iTCAM</span><span class="p">,</span> <span class="n">reqenb</span><span class="p">,</span> <span class="n">rpyenb</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">s0</span> <span class="o">++</span> <span class="n">s1</span> <span class="o">++</span> <span class="n">s2</span> <span class="o">++</span> <span class="n">s3</span> <span class="o">++</span> <span class="n">s4</span> <span class="o">++</span> <span class="n">s5</span> <span class="o">++</span> <span class="n">s6</span><span class="p">,</span> <span class="p">(</span><span class="n">oAXI</span><span class="p">,</span> <span class="n">oTCAM</span><span class="p">))</span>
<span class="kr">where</span></code></pre></figure>
<p>The body is a large <code class="language-plaintext highlighter-rouge">where</code> block defining the next values of pipeline stages <code class="language-plaintext highlighter-rouge">s0</code>-<code class="language-plaintext highlighter-rouge">s6</code>, and the connection to the later modules.</p>
<p>Our first act is to update the metadata in the pipeline as soon as flits enter. We need to segment packets and extract the dstMAC and part of the srcMAC depending on the current flit number.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">s0</span> <span class="o">=</span>
<span class="kr">let</span>
<span class="n">valid</span> <span class="o">=</span> <span class="n">tkeep</span> <span class="n">iAXI</span> <span class="o">></span> <span class="mi">0</span> <span class="c1">-- there is valid data in this.</span>
<span class="n">flitn</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">not</span> <span class="n">valid</span> <span class="kr">then</span> <span class="mi">0</span> <span class="c1">-- invalid, so not a flit</span>
<span class="kr">else</span> <span class="kr">if</span> <span class="n">btb</span> <span class="p">(</span><span class="n">tlast</span> <span class="p">(</span><span class="n">packet</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)))</span> <span class="kr">then</span> <span class="mi">1</span> <span class="c1">-- last packet was the end, so this is a new packet</span>
<span class="kr">else</span> <span class="p">(</span><span class="n">flitCount</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="c1">-- valid, so incr the count. 1 is the head</span>
<span class="c1">-- is_this_last = False -- we know that from tlast, added by the 10G port</span>
<span class="c1">-- first 6 octets</span>
<span class="n">dstmac</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">flitn</span> <span class="o">==</span> <span class="mi">1</span> <span class="kr">then</span> <span class="n">slice</span> <span class="n">d47</span> <span class="n">d0</span> <span class="p">(</span><span class="n">tdata</span> <span class="n">iAXI</span><span class="p">)</span> <span class="kr">else</span> <span class="p">(</span><span class="n">fromJust</span> <span class="p">(</span><span class="n">dstMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)))</span>
<span class="n">srcmacPartial</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">flitn</span> <span class="o">==</span> <span class="mi">1</span> <span class="kr">then</span>
<span class="n">slice</span> <span class="n">d63</span> <span class="n">d48</span> <span class="p">(</span><span class="n">tdata</span> <span class="n">iAXI</span><span class="p">)</span>
<span class="kr">else</span>
<span class="n">slice</span> <span class="n">d15</span> <span class="n">d0</span> <span class="p">(</span><span class="n">fromJust</span> <span class="p">(</span><span class="n">srcMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)))</span>
<span class="n">srcPorts</span> <span class="o">=</span> <span class="n">slice</span> <span class="n">d23</span> <span class="n">d16</span> <span class="p">(</span><span class="n">tuser</span> <span class="n">iAXI</span><span class="p">)</span>
<span class="kr">in</span> <span class="n">singleton</span> <span class="n">pipelineDefault</span> <span class="p">{</span> <span class="n">packet</span> <span class="o">=</span> <span class="n">iAXI</span><span class="p">,</span>
<span class="n">pktValid</span> <span class="o">=</span> <span class="n">valid</span><span class="p">,</span>
<span class="n">flitCount</span> <span class="o">=</span> <span class="n">flitn</span><span class="p">,</span>
<span class="n">dstMac</span> <span class="o">=</span> <span class="kt">Just</span> <span class="n">dstmac</span><span class="p">,</span>
<span class="n">srcMac</span> <span class="o">=</span> <span class="kt">Just</span> <span class="p">(</span> <span class="p">(</span><span class="mi">0</span> <span class="o">::</span> <span class="kt">BitVector</span> <span class="mi">32</span><span class="p">)</span> <span class="o">++#</span> <span class="n">srcmacPartial</span> <span class="p">),</span>
<span class="n">srcPorts</span> <span class="o">=</span> <span class="n">srcPorts</span> <span class="p">}</span></code></pre></figure>
<p>By stage 2 we know the full srcMAC and can prepare to make a lookup request. If the packet uses etype for size, we trust this. Otherwise we fall back on a possible prior module to have filled it in. Failing that we set the max length in wide use. This also shows the general form of performing a calculation only if this is the first flit in a packet as determined earlier.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="c1">-- stage 2: add src mac, - r0</span>
<span class="c1">-- if headFlit, then packet that just came in is flit 2 - has src and len</span>
<span class="n">s1</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">headFlit</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="kr">then</span>
<span class="n">singleton</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">srcMac</span> <span class="o">=</span> <span class="kt">Just</span> <span class="p">(</span> <span class="p">(</span><span class="n">slice</span> <span class="n">d31</span> <span class="n">d0</span> <span class="p">(</span><span class="n">tdata</span> <span class="n">iAXI</span><span class="p">))</span> <span class="o">++#</span>
<span class="p">(</span><span class="n">slice</span> <span class="n">d15</span> <span class="n">d0</span> <span class="p">(</span><span class="n">fromJust</span> <span class="p">(</span><span class="n">srcMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">))))</span>
<span class="p">),</span>
<span class="n">packetLen</span> <span class="o">=</span> <span class="kr">let</span>
<span class="n">etype</span> <span class="o">=</span> <span class="n">slice</span> <span class="n">d31</span> <span class="n">d16</span> <span class="p">(</span><span class="n">tdata</span> <span class="n">iAXI</span><span class="p">)</span>
<span class="kr">in</span>
<span class="c1">-- etype was protocol indicator, we can't determine the length of frame without store/forward</span>
<span class="kr">if</span> <span class="n">etype</span> <span class="o"><=</span> <span class="mi">1500</span> <span class="kr">then</span> <span class="n">etype</span>
<span class="kr">else</span> <span class="kr">if</span> <span class="n">slice</span> <span class="n">d15</span> <span class="n">d0</span> <span class="p">(</span><span class="n">tuser</span> <span class="n">iAXI</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span> <span class="kr">then</span> <span class="n">slice</span> <span class="n">d15</span> <span class="n">d0</span> <span class="p">(</span><span class="n">tuser</span> <span class="n">iAXI</span><span class="p">)</span>
<span class="kr">else</span> <span class="mi">9000</span> <span class="c1">-- max length</span>
<span class="p">}</span>
<span class="kr">else</span>
<span class="n">singleton</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">srcMac</span> <span class="o">=</span> <span class="n">srcMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">1</span><span class="p">),</span> <span class="n">packetLen</span> <span class="o">=</span> <span class="n">packetLen</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">1</span><span class="p">)</span> <span class="p">}</span></code></pre></figure>
<p>The next stages are identical. A outside module controls which of the 4 ports the reply from the mapping module affects, and so we repeat this block 4 times to ensure we catch the reply.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">s2</span> <span class="o">=</span> <span class="n">singleton</span> <span class="p">(</span><span class="n">applyTcamRpy</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">1</span><span class="p">)</span> <span class="n">rpyenb</span> <span class="n">iTCAM</span><span class="p">)</span>
<span class="n">s3</span> <span class="o">=</span> <span class="n">singleton</span> <span class="p">(</span><span class="n">applyTcamRpy</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">2</span><span class="p">)</span> <span class="n">rpyenb</span> <span class="n">iTCAM</span><span class="p">)</span>
<span class="n">s4</span> <span class="o">=</span> <span class="n">singleton</span> <span class="p">(</span><span class="n">applyTcamRpy</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">3</span><span class="p">)</span> <span class="n">rpyenb</span> <span class="n">iTCAM</span><span class="p">)</span>
<span class="n">s5</span> <span class="o">=</span> <span class="n">singleton</span> <span class="p">(</span><span class="n">applyTcamRpy</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">4</span><span class="p">)</span> <span class="n">rpyenb</span> <span class="n">iTCAM</span><span class="p">)</span></code></pre></figure>
<p>Finally we put all our metadata into the TUSER part of the pipeline.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">s6</span> <span class="o">=</span> <span class="kr">let</span> <span class="c1">-- add the output port and other tuser stuff</span>
<span class="n">s5now</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">s6now</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">6</span><span class="p">)</span>
<span class="kr">in</span> <span class="kr">let</span>
<span class="n">newtuser</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">headFlit</span> <span class="n">s5now</span> <span class="kr">then</span>
<span class="p">(</span><span class="n">slice</span> <span class="n">d127</span> <span class="n">d32</span> <span class="p">(</span><span class="n">tuser</span> <span class="p">(</span><span class="n">packet</span> <span class="n">s5now</span><span class="p">)))</span> <span class="o">++#</span>
<span class="p">(</span><span class="n">fromJust</span> <span class="p">(</span><span class="n">dstPorts</span> <span class="n">s5now</span><span class="p">))</span> <span class="o">++#</span>
<span class="p">(</span><span class="n">srcPorts</span> <span class="n">s5now</span><span class="p">)</span> <span class="o">++#</span>
<span class="p">(</span><span class="n">packetLen</span> <span class="n">s5now</span><span class="p">)</span>
<span class="kr">else</span> <span class="kr">if</span> <span class="p">(</span><span class="n">pktValid</span> <span class="n">s5now</span><span class="p">)</span> <span class="kr">then</span> <span class="c1">-- loop it</span>
<span class="n">tuser</span> <span class="p">(</span><span class="n">packet</span> <span class="n">s6now</span><span class="p">)</span> <span class="c1">-- prev packet</span>
<span class="kr">else</span>
<span class="p">(</span><span class="mi">0</span> <span class="o">::</span> <span class="kt">Tuser</span><span class="p">)</span>
<span class="kr">in</span>
<span class="n">singleton</span> <span class="n">s5now</span> <span class="p">{</span> <span class="n">packet</span> <span class="o">=</span> <span class="p">(</span><span class="n">packet</span> <span class="n">s5now</span><span class="p">)</span> <span class="p">{</span> <span class="n">tuser</span> <span class="o">=</span> <span class="n">newtuser</span> <span class="p">}}</span>
<span class="n">oAXI</span> <span class="o">=</span> <span class="p">(</span><span class="n">packet</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">6</span><span class="p">))</span> <span class="c1">-- cycle after assignment to s6 for any given flit</span></code></pre></figure>
<p>The only remaining logic is to use an outside signal to send mapping lookup requests at the right time.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">oTCAM</span> <span class="o">=</span>
<span class="kr">let</span>
<span class="c1">-- if pkt in fst pipeline stage is flitcount elem [1, 2, 3, 4] then we have the head, this is idx</span>
<span class="n">headloc</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">flitCount</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">2</span> <span class="o">&&</span> <span class="n">flitCount</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">)</span> <span class="o"><=</span> <span class="mi">5</span> <span class="kr">then</span>
<span class="kt">Just</span> <span class="p">((</span><span class="n">flitCount</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="mi">0</span><span class="p">))</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="kr">else</span> <span class="kt">Nothing</span>
<span class="kr">in</span>
<span class="kr">if</span> <span class="n">isJust</span> <span class="n">headloc</span> <span class="o">&&</span> <span class="n">btb</span> <span class="n">reqenb</span>
<span class="kr">then</span> <span class="kt">TCAMrequest</span> <span class="p">{</span> <span class="n">lookup_req</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
<span class="n">dst_mac</span> <span class="o">=</span> <span class="n">fromJust</span> <span class="p">(</span><span class="n">dstMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="p">(</span><span class="n">fromJust</span> <span class="n">headloc</span><span class="p">))),</span>
<span class="n">src_mac</span> <span class="o">=</span> <span class="n">fromJust</span> <span class="p">(</span><span class="n">srcMac</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="p">(</span><span class="n">fromJust</span> <span class="n">headloc</span><span class="p">))),</span>
<span class="n">src_port</span> <span class="o">=</span> <span class="p">(</span><span class="n">srcPorts</span> <span class="p">(</span><span class="n">state</span> <span class="o">!!</span> <span class="p">(</span><span class="n">fromJust</span> <span class="n">headloc</span><span class="p">)))</span> <span class="p">}</span>
<span class="kr">else</span> <span class="n">tcam_r_default</span></code></pre></figure>
<p>And whist fairly terse that is the full lookup logic for a 10g 4 port L2 switch. The benefits of maintaining a clear state machine may be felt even more when attempting to do kinds of low latency L3 switching.</p>
<h3 id="outside-module-declaration">Outside module declaration</h3>
<p>We define the initial state of the module as a empty pipeline (this also allows a reset wire to fully reset the module).
Next we use a ANNotation to subdivide the ports into the logic we need for outside interfacing, declare a clock and a reset that may be async, and mark this function as the compile target by assigning to <code class="language-plaintext highlighter-rouge">topEntity</code>.</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">oplInitalState</span> <span class="o">::</span> <span class="kt">OPLState</span> <span class="o">=</span> <span class="n">replicate</span> <span class="p">(</span><span class="kt">SNat</span> <span class="o">::</span> <span class="kt">SNat</span> <span class="mi">7</span><span class="p">)</span> <span class="n">pipelineDefault</span>
<span class="cp">{-# ANN topEntity
(Synthesize
{ t_name = "OPL"
, t_inputs = [
PortName "clk",
PortName "rst",
PortProduct "" [
PortProduct "" [PortName "I_DATA", PortName "I_KEEP", PortName "I_USER",
PortName "I_VALID", PortName "I_READY", PortName "I_LAST"],
PortProduct "" [PortName "TCAM_I_PORTS", PortName "TCAM_DONE",
PortName "TCAM_MISS", PortName "TCAM_HIT"],
-- PortName "TCAM_RPY",
PortName "REQ_ENB",
PortName "RPY_ENB"
]
]
, t_output = PortProduct "" [PortProduct "" [PortName "O_DATA", PortName "O_KEEP", PortName "O_USER",
PortName "O_VALID", PortName "O_READY", PortName "O_LAST"],
PortProduct "" [PortName "TCAM_O_DST_MAC", PortName "TCAM_O_SRC_MAC",
PortName "TCAM_O_SRC_PORT", PortName "TCAM_O_LOOKUP_REQ"]
-- PortName "TCAM_REQ"
]
}) #-}</span>
<span class="n">topEntity</span>
<span class="o">::</span> <span class="kt">Clock</span> <span class="kt">System</span> <span class="kt">Source</span>
<span class="o">-></span> <span class="kt">Reset</span> <span class="kt">System</span> <span class="kt">Asynchronous</span>
<span class="o">-></span> <span class="kt">Signal</span> <span class="kt">System</span> <span class="p">(</span><span class="kt">Stream</span><span class="p">,</span> <span class="kt">TCAMreply</span><span class="p">,</span> <span class="kt">Bit</span><span class="p">,</span> <span class="kt">Bit</span><span class="p">)</span>
<span class="o">-></span> <span class="kt">Signal</span> <span class="kt">System</span> <span class="p">(</span><span class="kt">Stream</span><span class="p">,</span> <span class="kt">TCAMrequest</span><span class="p">)</span>
<span class="n">topEntity</span> <span class="o">=</span> <span class="n">exposeClockReset</span> <span class="p">(</span><span class="n">mealy</span> <span class="n">opl_pass_mealy</span> <span class="n">oplInitalState</span><span class="p">)</span></code></pre></figure>
<h1 id="footnotes">Footnotes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The bandwidth of our switch is limited by the mapping lookup rate and this latency. As it tuns out, we could not add more ports as we are limited in both latency and bandwidth to the mapping module. The mapper is fully scheduled with no free clock lookups. Simultainusly the OPL is the same length as the width of the output port MAC. If it took 9 clocks to identify the destination, the first 64B of the packet would have to be delivered to the MAC without a destination. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>As part of Noa Z’s course, we modified the NetFPGA SUME reference design to support cut-through low latency switching at 10G line rate.FPGA Quantum Simulation2019-01-14T15:06:15+00:002019-01-14T15:06:15+00:00/haskell/fpga/2019/01/14/FPGA-quantum-simulation<h1 id="a-high-performance-quantum-circuit-simulator-in-cλash">A high performance quantum circuit simulator in CλaSH</h1>
<h2 id="introduction">Introduction</h2>
<p>For my master’s project, I aimed to develop a FPGA based quantum simulator with the objective of running >30 qubits at reasonable depth on a single Amazon F1 instance. Whilst the full-scale tests never happened, a tested general simulator was produced and was validated at 16 qubits on Zynq hardware. This series of blog posts will describe the simulation method and the use of CλaSH as a tool for rapid development of hardware networks.</p>
<p>Quantum simulation would be better termed quantum <em>emulation</em>, as hopefully real QC’s will come along and we will want to call the process of simulating physical systems on QC simulation. But with with few exceptions simulation has stuck and I hope this will not be too confusing for readers from 2025.</p>
<h2 id="contents">Contents</h2>
<ol>
<li>Quantum circuits (skip this if you have QC background)</li>
<li>The recursive simulation algorithm</li>
<li>Hardware layout</li>
<li>Benchmarks</li>
</ol>
<h2 id="the-simulation-method">The simulation method</h2>
<p>A quantum state on N qubits contains information about the relative magnitude and phase of all 2^N possible bitstrings. In the general, and common, case all 2^N values need to be stored as complex numbers. For even a 50 qubit simulation this requires a prohibitively large amount of memory (on the order of petabytes, and growing exponentially). Google has proposed a 72 qubit chip, and if it is able to run a square circuit this is likely to demonstrate true quantum advantage.</p>
<p>In order to push the boundaries of simulation groups from Google, Alibaba and Oxford use tensor network methods to simulate up to 52 qubit systems on datacenter scale computers.</p>
<p>We use a method of simulation inspired by the path integral formalism of QM. The likelihood of sampling a given final state is the sum of the probabilities of all paths that could possibly result in this state. This contrasts with the time evolution view of QM, where a inital state is evolved in some environment and then the final sample probabilities are found from the final state.</p>
<p>For large physical systems performing the integral over all possible priors can be challenging. For quantum circuits however it is straightforward - at each layer in the circuit there is a gate of low arity. The only states that can contribute are the states that when acted upon by that gate, give the target state.</p>
<p>This leads to a succinct <code class="language-plaintext highlighter-rouge">backwardsEvaluate</code> function for finding the amplitude of a given basis vector after applying a circuit:</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="o"><!--</span><span class="n">backwardsEvaluate</span> <span class="n">circuit</span> <span class="n">inital_state</span> <span class="n">target_state</span><span class="o">--></span>
<span class="n">backwardsEvaluate</span> <span class="kt">[]</span> <span class="n">i</span> <span class="n">t</span> <span class="o">=</span>
<span class="o">|</span> <span class="n">i</span> <span class="o">==</span> <span class="n">t</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="n">backwardsEvaluate</span> <span class="p">(</span><span class="n">gate</span><span class="o">:</span><span class="n">xs</span><span class="p">)</span> <span class="n">i</span> <span class="n">t</span> <span class="o">=</span>
<span class="kr">let</span>
<span class="n">prior_states</span> <span class="o">=</span> <span class="p">(</span><span class="n">possiblePriors</span> <span class="n">gate</span> <span class="n">t</span><span class="p">)</span>
<span class="kr">in</span> <span class="kr">let</span>
<span class="n">prior_amplitudes</span> <span class="o">=</span> <span class="n">map</span> <span class="p">(</span><span class="n">backwardsEvaluate</span> <span class="n">xs</span> <span class="n">i</span><span class="p">)</span> <span class="n">prior_states</span>
<span class="kr">in</span>
<span class="c1">-- the final amplitude is the sum of prior amplitudes</span>
<span class="c1">-- multiplied by the action of the gate on the prior state</span>
<span class="n">sum</span> <span class="p">(</span><span class="n">map</span> <span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="p">(</span><span class="n">zip</span> <span class="n">prior_amplitudes</span> <span class="p">(</span><span class="n">map</span> <span class="n">gate</span> <span class="n">prior_states</span><span class="p">)))</span></code></pre></figure>
<p>This function works in tandem with a forwards evaluator to get the action of a circuit on a inital state.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">simulate</span><span class="p">(</span><span class="n">circuit</span><span class="p">,</span> <span class="n">state</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span> <span class="c1"># zero is the |00...00> state.
</span> <span class="k">for</span> <span class="n">gate</span> <span class="ow">in</span> <span class="n">circuit</span><span class="p">:</span>
<span class="n">sucessors</span> <span class="o">=</span> <span class="n">act</span><span class="p">(</span><span class="n">gate</span><span class="p">,</span> <span class="n">circuit</span><span class="p">)</span>
<span class="n">amplitudes</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="n">backwardsEvaluate</span><span class="p">,</span> <span class="n">sucessors</span><span class="p">)</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="n">sucessors</span><span class="p">,</span>
<span class="n">weights</span><span class="o">=</span><span class="n">amplitudes</span><span class="p">.</span><span class="n">conj</span><span class="p">()</span><span class="o">*</span><span class="n">amplitudes</span><span class="p">,</span>
<span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">state</span><span class="p">,</span> <span class="n">amplitudes</span><span class="p">[</span><span class="n">sucessors</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">state</span><span class="p">)]</span></code></pre></figure>
<p>With a bit of work you can show that this function will return samples of the final state vector in proportion to the corresponding amplitude. This method also gives the true amplitude of that state so to sample the full vector you can call this function repeatedly until the absolute sum of amplitudes nears unity. Avoiding previously sampled elements of the state is a exercise for the reader!</p>
<p>The forwards function is presented in Python in order to emphasise that this function is not performance critical. In fact, in our implementation this function runs on the Zynq ARM core with only <code class="language-plaintext highlighter-rouge">backwardsEvaluate</code> built in hardware.</p>
<h3 id="performance-and-tweaking">Performance and tweaking</h3>
<p>This method, if the <code class="language-plaintext highlighter-rouge">backwardsEvaluate</code> function runs in a depth-first manner, will use space linear in the depth of the circuit. The tradeoff is that circuit runtime becomes both exponential in width and depth. The runtime can be reduced to that of the naïve matrix multiplication method by memoizing <code class="language-plaintext highlighter-rouge">backwardsEvaluate</code> - and if the circuit contains separable states they will never appear in the cache resulting in memory use potentially lower than naïve methods.</p>
<p>The advantage this method poses for FPGA instantiation is that by distributing the recursive calls to <code class="language-plaintext highlighter-rouge">backwardsEvaluate</code> in the form of a tree caches can be inserted in ways that provide high physical locality of reference. This allows for very wide memory parallelism and a computational structure that can be mapped to any fabric layout or BRAM availability.</p>
<p>As compared to a direct matrix method this avoids a bottleneck on DRAM access, a key limiting factor for high performance FPGA designs.</p>
<h2 id="the-hardware">The hardware</h2>
<p>The FPGA modules are written in CλaSH, with a ad-hoc wiring generator in Python and a Verilator test suite. Functionality was also verified on a Xilinx Zynq chip at nearly 100MHz.</p>
<p>CλaSH generates synthesizable verilog from Haskell functions of type <code class="language-plaintext highlighter-rouge">State -> Input -> (State, Output)</code> where state is the full internal state of your module clock-to-clock. The type of the top-level function that the CPU calls is:</p>
<figure class="highlight"><pre><code class="language-haskell" data-lang="haskell"><span class="n">findamp_mealy_N</span> <span class="o">::</span> <span class="kt">KnownNat</span> <span class="n">n</span> <span class="o">=></span> <span class="kt">ModuleState</span> <span class="n">n</span> <span class="o">-></span> <span class="kt">Input</span> <span class="o">-></span> <span class="p">(</span><span class="kt">ModuleState</span> <span class="n">n</span><span class="p">,</span> <span class="kt">Output</span><span class="p">)</span></code></pre></figure>
<p>We use KnownNat to add compile-time parameters to the module. In this case, the modules maintain a stack of evaluations to complete and <code class="language-plaintext highlighter-rouge">n</code> is the size of this stack.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>The hardware design was validated on a Zynq development board running Linux. The target device was the xc7z020clg400 at a -1 speed grade. Due to area limitations, and in order to generate a fully entangled intermediate state, a circuit of width 4 and depth 12 was used. The full circuit executed in 3µs, with timing correctly predicted by the RTL model.</p>
<h3 id="scaling-estimation">Scaling estimation</h3>
<p>In order to scale the design to multiple FPGA blades, we need to consider the total bandwidth that may be consumed communicating between parts of the design.</p>
<p>Amazon offers the F1 instance type, making up to 8 FPGA blades consisting of Xilinx Ultrascale+ ZU9P FPGAs with 2586k LUTs. The blades are interconnected via a 400 Gbps bidirectional ring interconnect[@Amazon_EC2_F1_Instances]. In the worst case, a single blade may consist of many low-depth <code class="language-plaintext highlighter-rouge">FindAmp</code> modules, with minimum sized stack buffers.</p>
<p>If the modules are responsible for evaluations of depth 2, they will be able to process a new request every 11 cycles. For this design point, where the entire fabric is consuming bandwidth, the bandwidth requirement exceeds the available bandwidth by 70%. More reasonable layouts should therefore not be bandwidth limited on the Amazon FPGA service.</p>A high performance quantum circuit simulator in CλaSH