Performance

Opal Kelly’s FrontPanel consists of HDL modules within the FPGA, firmware on the USB microcontroller (or PCIe bridge device), and an API on the PC that have been optimized for both performance and a clean abstraction.

Achieving the highest level of performance for your particular application requires an understanding of the components being used and how certain things affect performance.  By following a few simple strategies and applying these notes, your application will be a top performer and still benefit from the ease of use and flexible abstraction that only FrontPanel provides.

Your Mileage May Vary

Performance metrics vary significantly based on a number of factors including, but certainly not limited to, CPU performance, memory performance, USB chipset, operating system, driver configuration, etc. We have posted performance results from our PipeTest sample for some representative computers but even the same hardware can produce different results as the OS and software evolve.

Measured Performance

We have posted a few performance benchmarks taken with our PipeTest sample. These benchmarks are captured by running this application as below. On each, we’ve included some details of the computer used to perform the benchmark.

PipeTest pipetest.bit benchCode language: Shell Session (shell)

The console output for the PipeTest benchmark includes the following acronyms:

  • Transfer Size (TS) – The total size of the benchmark transfer. We set the transfer size appropriately to attempt to get a long enough execution time for accurate timing statistics. Multiple API calls may be required to reach this total transfer size.
  • Segment Size (SS) – This corresponds to the length parameter passed to the corresponding FrontPanel API methods used to perform the transfer.
  • Block Size (BS) – This corresponds to the blockSize parameter passed to the corresponding FrontPanel API methods used to perform the transfer.

Note: Blocks sizes of 0 show the performance for the non block-throttle pipe variants WriteToPipeIn and ReadFromPipeOut .

Wires and Triggers

Wires and triggers provide the most basic form of communication between the FPGA and the PC.  From a performance perspective, wires can be read or written several thousand times per second.  All WireIns are read simultaneously, regardless of which ones you are interested in.  Similarly, all WireOuts are written simultaneously.

Activating a TriggerIn is a very fast operation and can operate at several thousand times per second.  Only one trigger is written per call.  Updating TriggerOuts is similar to reading all WireOuts: all TriggerOuts are read simultaneously.

Since Wire and Trigger updates are always blocking API calls, these measurements provide some indication of the latency performance of the device.

Measured Performance (CPS = Calls Per Second)

API CALLUSB 3.0 (CPS)USB 2.0 (CPS)PCIE (CPS)
UpdateWireIns5,000+1,000+4,000+
UpdateWireOuts4,000+800+3,000+
ActivateTriggerIn8,000+2,000+66,000+
UpdateTriggerOuts4,000+800+3,000+

Pipes (Bulk Transfers)

Pipes are the fastest way to transmit or receive bulk data.  Due to overhead, performance is best with long transfers.  Each time you perform a pipe transfer, several layers of setup are required including those at the firmware level, API level, and operating system level.  Therefore, it is best to design around using long transfers, if possible.  This generally means using large buffer sizes on the FPGA and relying on external memory when possible.

Low-latency, high-bandwidth transfers present a special challenge to any protocol and USB (and therefore FrontPanel) is no different.  In this case, the two goals are at odds: trying to perform many operations and still achieve high bandwidth.  The problem is that the overhead associated with setting up each transfer cuts into the time available to perform the data transfer.

It is important to note that Windows, Linux, and Mac OS X are not real-time operating systems.  They are complex systems that may have many other processes taking higher priority at any given time.  Therefore, it is often the case that simple operations (like moving a window) dramatically reduce transfer bandwidth.  This should be a consideration when designing the buffering for any bandwidth-dependent application.

NOTE: Pipes in FrontPanel-3 are actually a subset of Block-Throttled Pipes where the EP_READY signal is always asserted, thus disabling any throttling.  Also, block sizes are always 1024 bytes except for the last block which may be smaller to account for the total length of the transfer.  Block sizes are 64 bytes when the device is enumerated at full-speed.

Block-Throttled Pipes (Bulk Transfers)

Block-Throttled Pipes are available for USB 3.0 implementations of FrontPanel.  They provide equivalent performance to the standard pipe except that the FPGA can throttle the data transfer at the block level.  The block is programmable by the user with highest performance achieved at the largest (1,024-byte) block size.

BTPipes are an excellent way to achieve high performance with smaller buffer sizes because the FPGA can negotiate the transfer at a low level without incurring the significant overhead of  setting up a new transfer for each small buffer block.

Isochronous Transfers?

FrontPanel does not support USB isochronous transfers.  It is true that isochronous transfers can negotiate for guaranteed bandwidth on the USB which can be very helpful when trying to build a system that must deliver certain performance to the end-user.  However, this guarantee comes at a significant price: isochronous transfers do not provide the same level of error-detection and error-correction that the more reliable USB bulk transfers provide.  Furthermore, the guarantee is only for bus bandwidth and says nothing about the operating system’s capabilities.

If an error occurs during the transmission of a bulk transfer, the host will request that the missing packet be repeated.  The host will also properly reconstitute the transmission so that everything is properly sequenced.

With isochronous transfers, the bandwidth and latency requirements trump delivery accuracy.  Therefore, it is possible that some data may be lost in this pursuit.  Isochronous transfers were created for things such as multimedia content that requires on-time delivery.  But if the host is too busy or something interrupts the transfer, a few missing frames of video or a few milliseconds of audio are considered expendable.