A Look at the FPGA Targeting of this Versatile Toolkit
James Reinders, Editor Emeritus, The Parallel Universe
In this article, we’ll take a firsthand look at how to use Intel® Arria® 10 FPGAs with the OpenVINO™ toolkit (which stands for open visual inference and neural network optimization). The OpenVINO toolkit has much to offer, so I’ll start with a high-level overview showing how it helps develop applications and solutions that emulate human vision using a common API. Intel supports targeting of CPUs, GPUs, Intel® Movidius™ hardware including their Neural Compute Sticks, and FPGAs with the common API. I especially want to highlight another way to use FPGAs that doesn’t require knowledge of OpenCL* or VHDL* to get great performance. However, like any effort to get maximum performance, it doesn’t hurt to have some understanding about what’s happening under the hood. I’ll shed some light on that to satisfy your curiosity―and to help you survive the buzzwords if you have to debug your setup to get things working.
We’ll start with a brief introduction to the OpenVINO toolkit and its ability to support vision-oriented applications across a variety of platforms using a common API. Then we’ll take a look at the software stack needed to put the OpenVINO toolkit to work on an FPGA. This will define key vocabulary terms we encounter in documentation and help us debug the machine setup should the need arise. Next, we’ll take the OpenVINO toolkit for a spin with a CPU and a CPU+FPGA. I’ll discuss why “heterogeneous” is a key concept here (not everything runs on the FPGA). Specifically, we’ll use a high-performance Intel® Programmable Acceleration Card with an Intel Arria® 10 GX FPGA. Finally, we’ll peek under the hood. I’m not the type to just drive a car and never see what’s making it run. Likewise, my curiosity about what’s inside the OpenVINO toolkit when targeting an FPGA is partially addressed by a brief discussion of some of the magic inside.
The Intel Arria 10 GX FPGAs I used are not the sort of FPGAs that show up in $150 FPGA development kits. (I have more than a few of those.) Instead, they’re PCIe cards costing several thousand dollars each. To help me write this article, Intel graciously gave me access for a few weeks to a Dell EMC PowerEdge* R740 system, featuring an Intel Programmable Acceleration Card with an Arria 10 GX FPGA. This gave me time to check out the installation and usage of the OpenVINO toolkit on FPGAs instead of just CPUs.
The OpenVINO Toolkit
To set the stage, let’s discuss OpenVINO toolkit and its ability to support vision-oriented applications across a variety of platforms using a common API. Intel recently renamed the Intel® Computer Vision SDK as the OpenVINO toolkit. Looking at all that’s been added, it’s not surprising Intel wanted a new name to go with all the new functionality. The toolkit includes three new APIs: the Deep Learning Deployment toolkit, a common deep learning inference toolkit, and optimized functions for OpenCV* and OpenVX*, with support for the ONNX*, TensorFlow*, MXNet*, and Caffe* frameworks.
The OpenVINO toolkit offers software developers a single toolkit for applications that need human-like vision capabilities. It does this by supporting deep learning, computer vision, and hardware acceleration with heterogeneous support—all in a single toolkit. The OpenVINO toolkit is aimed at data scientists and software developers working on computer vision, neural network inference, and deep learning deployments who want to accelerate their solutions across multiple hardware platforms. This should help developers bring vision intelligence into their applications from edge to cloud. Figure 1 shows potential performance improvements using the toolkit.
Accuracy changes can occur with Fp16. The benchmark results reported in this deck may need to be revised as additional testing is conducted. The results spend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system, or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. For more complete information about the performance and benchmark results, visit www.intel.com/benchmarks. Configuration: Intel® Core™ i7 processor 6700 at 2.90 GHz fixed. GPU GT2 at 1.00 GHz fixed. Internal ONLY testing performed 6/13/2018, test v3 15.21. Ubuntu* 16.04 OpenVINO™ toolkit 2018 RC4, Intel® Arria 10 FPGA 1150GX. Tests were based on various parameters such as model used (these are public), batch size, and other factors. Different models can be accelerated with different Intel® hardware solutions, yet use the same Intel® Software Tools. Benchmark source: Intel Corporation.
Figure 1 – Performance improvement using the OpenVINO toolkit
While it’s clear that Intel has included optimized support for Intel® hardware, top-to-bottom support for OpenVX APIs provides a strong non-Intel connection, too. The toolkit supports both OpenCV and OpenVX. Wikipedia sums up as follows: “OpenVX is complementary to the open source vision library OpenCV. OpenVX, in some applications, offers a better optimized graph management than OpenCV.” The toolkit includes a library of functions, pre-optimized kernels, and optimized calls for both OpenCV and OpenVX.
The OpenVINO toolkit offers specific capabilities for CNN-based deep learning inference on the edge. It also offers a common API that supports heterogeneous execution across CPUs and computer vision accelerators including GPUs, Intel Movidius hardware, and FPGAs.
Vision systems hold incredible promise to change the world and help us solve problems. The OpenVINO toolkit can help in the development of high-performance computer vision and deep learning inference solutions—and, best of all, it’s a free download.
FPGA Software Stack, from the FPGA up to the OpenVINO Toolkit
Before we jump into using the OpenVINO toolkit with an FPGA, let’s walk through what software had to be installed and configured to make this work. I’ll lay a foundational vocabulary and try not to dwell too much on the underpinnings. In the final section of this article, we’ll revisit to ponder some of the under-the-hood aspects of the stack. For now, it’s all about knowing what has to be installed and working.
Fortunately, most of what we need for the OpenVINO toolkit to connect to FPGAs is collected in a single install called the Intel Acceleration Stack, which can be downloaded from the Intel FPGA Acceleration Hub. All we need is the Runtime version (619 MB in size). There’s also a larger development version (16.9 GB), which we could also use because it includes the Runtime. This is much like the choice of installing a runtime for Java* or a complete Java Development Kit. The choice is ours. The Acceleration Stack for Runtime includes:
- The FPGA programmer (called Intel®Quartus® Prime Pro Edition Programmer Only)
- The OpenCL runtime (Intel® FPGA Runtime Environment for OpenCL)
- The Intel FPGA Acceleration Stack, which includes the Open Programmable Acceleration Engine (OPAE). OPAE is an open-source project that has created a software framework for managing and accessing programmable accelerators.
I know from personal experience that there are a couple of housekeeping details that are easy to forget when setting up an FPGA environment: the firmware for the FPGA and the OpenCL Board Support Package (BSP). Environmental setup for an FPGA was a new world for me, and reading through FPGA user forums confirmed that I’m not alone. Hopefully, the summary I’m about to walk through, “up-to-date acceleration stack, up-to-date firmware, up-to-date OpenCL with BSP,” can be a checklist to help you know what to research and assure on your own system.
FPGA Board Firmware: Be Up to Date
My general advice about firmware is to find the most up-to-date version and install it. I say the same thing about BIOS updates, and firmware for any PCIe card. Firmware will come from the board maker (for an FPGA board like I was using, the Intel® Programmable Acceleration Card [PAC] with an Arria® 10 GX FPGA). Intel actually has a nice chart showing which firmware is compatible with which release of the Acceleration Stack. Updating to the most recent Acceleration Stack requires the most recent firmware. That’s what I did. You can’check the latest firmware version with the command sudo fpgainfo fme.
OpenCL BSP: Be Up to Date
You can hardly use OpenCL and not worry about having the right BSP. BSPs originally served in the embedded world to connect boards and real-time operating systems―which certainly predates OpenCL. However, today, for FPGAs, a BSP is generally a topic of concern because it connects an FPGA in a system to OpenCL. Because support for OpenCL can evolve with a platform, it’s essential to have the latest version of a BSP for our particular FPGA card. Intel integrates the BSPs with their Acceleration Stack distributions, which is fortunate because this will keep the BSP and OpenCL in sync if we just keep the latest software installed. I took advantage of this method, following the instructions to select the BSP for my board. This process included installing OpenCL itself with the BSP using the aocl install command (the name of which is an abbreviation of Altera OpenCL*).
Is the FPGA Ready?
When we can type aocl list-devices and get a good response, we’re ready. If not, then we need to pause and figure out how to get our FPGA recognized and working. The three things to check:
- Install the latest Acceleration Stack software
- Verify firmware is up-to-date
- Verify the OpenCL is installed with the right BSP
I goofed on the last two, and lost some time until I corrected my error―so I was happy when I finally saw:
Vendor: Intel Corp
Physical Dev Name Status Information
pac_a10_eb00000 Passed PAC Arria 10 Platform
FPGA temperature = 57 degrees C. DIAGNOSTIC_PASSED _______________________________________________________________________
Figure 2 – Intel® Programmable Acceleration Card with an Intel Arria 10 GX FPGA
The OpenVINO Toolkit Targeting CPU+FPGA
After making sure that we’ve installed the FPGA Acceleration Stack, updated our board firmware, and activated OpenCL with the proper BSP, we’re ready to install the OpenVINO toolkit. I visited the OpenVINO toolkit website to obtain a prebuilt toolkit by registering and downloading “OpenVINO toolkit for Linux* with FPGA Support v2018R3.” The complete offline download package was 2.3 GB. Installation was simple. I tried both the command-line installer and the GUI installer (setup_GUI.sh). The GUI installer uses X11 to popup windows and was a nicer experience.
We’ll start by taking OpenVINO toolkit for a spin on a CPU, and then add the performance of an Intel Programmable Acceleration Card with an Arria 10 GX FPGA.
Intel has packaged a few demos to showcase OpenVINO toolkit usage, including SqueezeNet. SqueezeNet is a small CNN architecture that achieves AlexNet*-level accuracy on ImageNet* with 50x fewer parameters. The creators said it well in their paper: “It’s no secret that much of deep learning is tied up in the hell that is parameter tuning. [We make] a case for increased study into the area of convolutional neural network design in order to drastically reduce the number of parameters you have to deal with.” Intel’s demo uses a Caffe SqueezeNet model―helping show how the OpenVINO toolkit connects with popular platforms.
I was able to run SqueezeNet on the CPU by typing:
cd /opt/intel/computer_vision_sdk_fpga_<version>/deployment_tools/demo ./demo_sgueezenet_download_convert_run.sh
I was able to run SqueezeNet on the FPGA by typing:
cd /opt/intel/computer_vision_sdk_fpga<version>/deployment_tools/demo ./demo_sgueezenet_download_convert_run.sh -d HETERO:FPGA,CPU
I said “FPGA,” but you’ll note that I actually typed HETERO:FPGA,CPU. That’s because, technically, the FPGA is asked to run the core of the neural network (inferencing), but not our entire program. The inferencing engine has a very nice error message to help us understand what we’ve specified that still runs on the CPU:
./demo_sgueezenet_download_convert_run.sh -d FPGA
I’ll be told:
Graph is not supported on FPGA plugin due to existence of layer (Name:prob, Type: SoftMax) in topology. Most likely you need to use heterogeneous plugin instead of FPGA plugin directly.
This simple demo example will run slower on an FPGA because the demo is so brief that the overhead of FPGA setup dominates the runtime. To overcome this, I did the following:
export myDIR=/opt/intel/computer_vision_sdk_fpga_2018.3.343 cd $myDIR/deployment_tools/demo/
aocl program ac10 $myDIR/a10_dcp_bitstreams/2-0-1_RC_FP11_SgueezeNet.aocx
export myPIC=$IE INSTALL/demo/car.png
csa -m squeezenet1.1.xml -i $myPIC -d HETERO:FPGA,CPU -ni 100 -nireq 3
csa -m squeezenetl.1.xml -i SmyPIC -ni 100 -nireq 3
These commands let me avoid the redundant commands in the script, since I know I’ll run twice. I manually increased the iteration counts (the –ni parameter) to simulate a more realistic workload that overcomes the FPGA setup costs of a single run. This simulates what I’d expect in a long-running or continuous inferencing situation that would be appropriate with an FPGA-equipped system in a data center.
On my system, the CPU did an impressive 368 frames per second (FPS), but the version that used the FPGA was even more impressive at 850 FPS. I’m told that the FPGA can outstrip the CPU by even more than that for more substantial inferencing workloads, but I’m impressed with this showing. By the way, the CPU that I used was a dual-socket Intel® Xeon® Silver processor with eight cores per socket and hyperthreading. Beating such CPU horsepower is fun.
What Runs on the FPGA? A Bitstream
What I would call a “program” is usually called a “bitstream” when talking about an FPGA. Therefore, FPGA people will ask, “What bitstream are you running?” The demo_squeezenet_download_convert_run.sh script hid the magic of creating and loading a bitstream. Compiling a bitstream isn’t fast, and loading is pretty fast, but neither needs to happen every time because, once loaded on the FPGA, it remains available for future runs. The aocl program acl0… command that I issued loads the bitstream, which was supplied by Intel for supported neural networks. I didn’t technically need to reload it, but I choose to expose that step to ensure the command will work even if I ran other programs on the FPGA in between.
Wait…Is that All?
The thing I liked about using the OpenVINO toolkit with an FPGA was that I could easily say, “Hey, when are you going to tell me more?” Let’s review what we’ve covered:
- If we have a computer vision application, and we can train it using any popular platform (like Caffe), then we can deploy the trained network with the OpenVINO toolkit on a wide variety of systems.
- Getting an FPGA working means installing the right Acceleration Stack, updating board firmware, getting OpenCL installed with the right BSP, and following the OpenVINO toolkit Inference Engine steps to generate use the appropriate FPGA bitstream for our neural net.
- And then it just works.
Sorry, there’s no need to discuss OpenCL or VHDL programming. (You can always read my article on OpenCL programing in Issue 31 of The Parallel Universe.)
For computer vision, the OpenVINO toolkit, with its Inference Engine, lets us leave the coding to FPGA experts―so we can focus on our models.
Inside FPGA Support for the OpenVINO Toolkit
There are two very different under-the-hood things that made the OpenVINO toolkit targeting an FPGA very successful:
- An abstraction that spans devices but includes FPGA support
- Very cool FPGA support
The abstraction I speak of is Intel’s Model Optimizer and its usage by the Intel Inference Engine. The Model Optimizer is a cross-platform, command-line tool that:
- Facilitates the transition between the training and deployment environment
- Performs static model analysis
- Adjusts deep learning models for optimal execution on end-point target devices
Figure 3 shows the process of using the Model Optimizer, which starts with a network model trained using a supported framework, and the typical workflow for deploying a trained deep learning model.
Figure 3 – Using the Model Optimizer
The inference engine in our SqueezeNet example simply sends the work to the CPU or the FPGA based on our command. The intermediate representation (IR) that came out of the Model Optimizer can be used by the inferencing engine to process on a variety of devices including CPUs, GPUs, Intel Movidius hardware, and FPGAs. Intel had also done the coding work to create an optimized bitstream for the FPGA that uses the IR to configure itself to handle our network, which brings us to my second under-the-hood item.
The very cool FPGA support is a collection of carefully tuned codes written by FPGA experts. They’re collectively called the Deep Learning Accelerator (DLA) for FPGAs, and they form the heart of the FPGA acceleration for the OpenVINO toolkit. Using the DLA gives us software programmability that’s close to the efficiency of custom hardware designs, thanks to those expert FPGA programmers who worked hard to handcraft it. (If you want to learn more about the DLA, I recommend the team’s paper, “DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration.” They describe their work as “a methodology to achieve software ease-of-use with hardware efficiency by implementing a domainspecific, customizable overlay architecture.”)
Wrapping Up and Where to Learn More
I want to thank the folks at Intel for granting me access to systems with Arria 10 FPGAs cards. This enabled me to evaluate firsthand the ease with which I was able to exploit heterogeneous parallelism and FPGA-based acceleration. I’m a need-for-speed type of programmer―and the FPGA access satisfied my craving for speed without making me use any knowledge of FPGA programming.
I hope you found this walkthrough interesting and useful. And I hope sharing the journey as FPGA capabilities get more and more software support is exciting to you, too.
Here are a few links to help you continue learning and exploring these possibilities:
- OpenVINO toolkit main website (source/GitHub site is here)
- OpenVINO toolkit Inference Engine Developer Guide
- Intel Acceleration Stack can be downloaded from the Intel FPGA Acceleration Hub
- ONNX, an open format to represent deep learning models
- Deep Learning Deployment Toolkit Beta from Intel
- “FPGA Programming with the OpenCL™ Platform,” by James Reinders and Tom Hill, The Parallel Universe, Issue 31
- Official OpenCL standards information
- Intel FPGA product information: Intel® Cyclone® 10 LP, Intel® Arria® 10, and Intel® Stratix® 10
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.