As part of Noa Z’s course, we modified the NetFPGA SUME reference design to support cut-through low latency switching at 10G line rate.
As part of this project a new output port lookup module was written in CλaSH, a Haskell derived DSL compiling to Verilog. The pipelined design was exceptionally succinct when compared to the reference Verilog design.
The document is a literate haskell walk-through of the OPL module, for use as a building block for other pipelined systems. The code cannot be used as-is without the closed NetFPGA SUME codebase and so imports/external references are deliberately omitted with salient details given in the text.
A cut-through ethernet switch has a latency that is not dependant on the length of the packet. Cut-through designs commonly have lower average latency and much lower worst case latency than store-and-forward designs. THe advent of FEC in 40G and above speeds obsoletes cutthrough designs as the FEC code, located at the end of the packet, is required to decode the destination MAC for layer 2 routing.
The architecture of the OPL
The Xilinx MAC’s use the AXI-stream protocol for transmitting data. The MAC’s use a 156.25MHz internal clock and this design runs at that rate without gearboxes for simplicity.
AXI Stream and pipeline depth
A sample of the data on a AXI stream bus is given by the following type:
Our OPL module has 8 cycles of latency. 3 clocks are required to get the source MAC. 4 more clocks are needed in order to interleave the 4 ports of access to the MAC->PORT mapping module. A final clock is used to simplify output logic. The source MAC is required in order to act as a learning switch, as we may need to update our MAC->PORT mapping with this source and so the lookup cannot simply begin until we know this.1
The pipeline type contains the AXI stream connections within the module, as well as all the metadata the OPL has calculated so far. The module is a “conveyor belt” design that moves the data in a
PipelineStage forward every cycle, modifying the data as it goes.
This writing style keeps all the data and metadata together. The downside is much wider datapaths in clash than would be required for just buffering the AXI stream for 8 clocks. The clash compiler did not attempt to do any wiring simplification but this should be removed at elaboration time.
The Haskell Maybe type maps exactly to a Verilog type with a additional valid wire. If a Mabye type is exposed, it is easy enough to make combinatorial packers and unpackers that maintain idiomatic styles in both languages.
Packet routing functions
Due to the designation mapper being a bottleneck in our design, we disabled support for the NetFPGA host ports. We can use maybe to set the DMA designations to zero only if they were valid.
Next we route the packets based upon the metadata in the current pipeline stage. Haskell pattern matching makes this kind of if-else logic clear. We set the destination for packets to the bcast MAC to all ports, otherwise if the MAC was not seen in the lookup table we send to all but the source port. If we have a successful lookup we use that routing.
Without the INLINE declaration, every haskell function generates a new Verilog module. For small functions like this it is simpler to fold all the logic into one module.
applyTcamRpy looks for the results of a lookup in the current pipeline stage if this flit is the first one in a packet. Else, we replicate the destination from the last flit. This ensures that packets are not fragmented.
The lookup pipeline
The final function is also the largest. At 100 lines of Haskell, it declares the action on each pipeline stage. The type is
The body is a large
where block defining the next values of pipeline stages
s6, and the connection to the later modules.
Our first act is to update the metadata in the pipeline as soon as flits enter. We need to segment packets and extract the dstMAC and part of the srcMAC depending on the current flit number.
By stage 2 we know the full srcMAC and can prepare to make a lookup request. If the packet uses etype for size, we trust this. Otherwise we fall back on a possible prior module to have filled it in. Failing that we set the max length in wide use. This also shows the general form of performing a calculation only if this is the first flit in a packet as determined earlier.
The next stages are identical. A outside module controls which of the 4 ports the reply from the mapping module affects, and so we repeat this block 4 times to ensure we catch the reply.
Finally we put all our metadata into the TUSER part of the pipeline.
The only remaining logic is to use an outside signal to send mapping lookup requests at the right time.
And whist fairly terse that is the full lookup logic for a 10g 4 port L2 switch. The benefits of maintaining a clear state machine may be felt even more when attempting to do kinds of low latency L3 switching.
Outside module declaration
We define the initial state of the module as a empty pipeline (this also allows a reset wire to fully reset the module).
Next we use a ANNotation to subdivide the ports into the logic we need for outside interfacing, declare a clock and a reset that may be async, and mark this function as the compile target by assigning to
The bandwidth of our switch is limited by the mapping lookup rate and this latency. As it tuns out, we could not add more ports as we are limited in both latency and bandwidth to the mapping module. The mapper is fully scheduled with no free clock lookups. Simultainusly the OPL is the same length as the width of the output port MAC. If it took 9 clocks to identify the destination, the first 64B of the packet would have to be delivered to the MAC without a destination. ↩